Loading model takes over two hours?!

I started playing around with deeplearning4j. I want to load a pretrained word2vec model (in Dutch) and do some tests. Loading the model (binary, 1 Gb) took my system over two hours, which is way too long, I presume?
This the code (in Eclipse). The line with the invoke-statement takes over two hours:

Can anyone help me out? Can I speed things up? What I would like to build is a system in Java that creates word-associations from a starting word, in Dutch.

Thanks a lot!

There are a few things wrong here:

  1. Why are you using reflection?
  2. Your dependencies look weird, why do you have an explicit dependency on nd4j-buffer? Why are you mixing cuda versions?
  3. Where are you reading the data from? What kind of storage is it?
  4. Even though your pom.xml says beta6, are you entirely sure you aren’t somehow on beta7 (tried to downgrade after seeing something like this: Beta 7 - Glove Word Vector - #4 by treo, but Eclipse didn’t properly pick up on that)

Typically loading a 1GB binary takes about as long as it takes to read the file - 2 hours obviously is way too long.

If you can, running your application with a profiler should also shed some light into why it takes so long to load it.

Hi Treo,

Thanks a lot for looking into my situation. My answers:

  1. What do you mean with ‘reflection’? And where am I using that?
  2. Mixed Cuda versions, you’re right. I checked my version and took 10.1 out. I also deleted the nd4j-buffer dependency.
  3. I’m reading from a file that I downloaded from NLPL word embeddings repository (tested both the Dutch .bin and .txt downloads, same long loading time). I assumed that I can use these files also for deeplearning4j. If not: do you know of a Duch model that is suited? I understand that the model from Google news is English only.
  4. How can I check “where I’m on”?

NB. I also tested this line for loading the model, with the same effect:

Word2Vec word2vec = WordVectorSerializer.readWord2VecModel(modelFile);

The code you’ve shared in your original post uses reflection, i.e. getDeclaredMethod and invoke.

As we can’t quite trust your pom.xml file here, you can check it at runtime with:

        System.out.println(VersionCheck.versionInfoString());

That will print all versions for you.

I’ve downloaded the dutch w2v compatible file from the repository you’ve linked (http://vectors.nlpl.eu/repository/20/39.zip).

As I’m running version 1.0.0-beta7, I applied the workaround from the linked post (Beta 7 - Glove Word Vector - #4 by treo), loaded the txt file and stopped the time:

Nd4j.getEnvironment().allowHelpers(false);
long start = System.nanoTime();
Word2Vec word2vec = WordVectorSerializer.readWord2VecModel(new File("C:\\Users\\dubs\\Downloads\\39\\model.txt"));
long stop = System.nanoTime();
System.out.println("runtime = " + (stop - start) / 1e9);

This tells me that it took 82 seconds to load.

I repeated this with the binary model file (model.bin) and without the workaround, as that isn’t needed for binary model loading:

long start = System.nanoTime();
Word2Vec word2vec = WordVectorSerializer.readWord2VecModel(new File("C:\\Users\\dubs\\Downloads\\39\\model.bin"));
long stop = System.nanoTime();
System.out.println("runtime = " + (stop - start) / 1e9);

This took 57 seconds to load.

1 Like