Beta 7 - Glove Word Vector

Hi

We were using DL4J Beta6 version to load Glove 300d word vector file. It stopped working after upgrading to beta7. I posted this same question in Gitter and was given this commit history
Remove GloVe (#437) · eclipse/deeplearning4j@2cdb2b3 · GitHub.

Seems the support for Glove was removed in Beta7. Is that the case?

However the beta7 documentation shows the file can be loaded using the loadTxtVectors method.

https://deeplearning4j.konduit.ai/language-processing/word2vec#glove-global-vectors

I did try that. It takes forever, even for the 50d file. It doesn’t load at all. I did try few other methods in the WordVectorSerializer class. None of those worked.

Please suggest if there are any workarounds.

Thanks.

We have removed the support for creating new Glove based word vector files. However, once they are created there is no difference between a word vector file that was created by word2vec and glove, so loading them should not be affected.

Can you share more details about your System and how big your glove file is?

Hello,

Thanks for the quick response. I tried in different configurations. Some configs below

  • Xeon 4 Core, 32 GB RAM, Jdk 8, Windows 10
  • Core i7 4 Core Hyperthreading (8 logical cores), 16 GB RAM, Jdk 8, Windows 10
    File detail - glove.6B.50d.txt (167 MB), glove.6B.100d.txt (338 MB), glove.6B.300d.txt (1 GB). All files are text files.

If I switch back to beta 6, glove.6B.50d.txt file loads in about 5 seconds. In beta 7 it doesn’t load at all. Program is just a single line

WordVectors wordVectors = WordVectorSerializer.readWord2VecModel(new File(“glove.6B.50d.txt”));

It looks like you have stumbled upon a bug in beta7.

As a workaround, you can load the file in beta6 and then save it as a binary in dl4j format like this:

Word2Vec wordVectors = WordVectorSerializer.readWord2VecModel(new File("glove.6B.50d.txt"));
WordVectorSerializer.writeWord2VecModel(wordVectors, new File("glove.bin"));

And then you can reload the binary in beta7:

Word2Vec wordVectors = WordVectorSerializer.readWord2VecModel(new File("glove.bin"));

We have filed an issue for the underlying problem: ND4J: concat op fails with 3073 or more inputs on CPU · Issue #8961 · eclipse/deeplearning4j · GitHub

Edit: It appears to be a problem with the mkldnn implementation. If you don’t want to go back and forth between versions, you can also disable mkldnn with Nd4j.getEnvironment().allowHelpers(false); before loading and save the file in another format that way.