Extending ParagraphVectors vocab and re-fitting a model

Hi. I’m having a problem while trying to extend an existing ParagraphVectors model with a new set of labelled documents and re-fitting it.
I’m using 1.0.0-beta7 version of ND4J and DL4J
The idea is to allow the model to be extended when a new set of labelled documents needs to be processed so that the vocabulary and lookup table are synched. I couldn’t find a DL4J ready-to-use solution (looked through the source code, but nothing fits) so I decided to make own implementation. As a result, after re-building a vocab using SequenceVectors.buildVocab() and creating a new lookup table (for a new extended vocab) with the same config as the existing one, I fit the model and get “Process finished with exit code -1073741819 (0xC0000005)” in my IDE with no stack trace. It happens all the time (not an intermittent issue). Does anyone have an idea what could be the cause of this issue ? Debugging shows that the error is being thrown directly in Nd4jCpu.java, method execCustomOp2() called by SkipGram.java.iterateSample(List<BatchItem> items, line 534).

Thanks in advance!

This might help:

That crash is generally a native JVM crash. If this doesn’t work for you, could you try to find an hs-err.log in your folder where you ran the project?

@agibsonccc thanks a lot for a hint with this example and error logs!

I’ve analyzed the example and unfortunately it doesn’t actually do what I need. It re-fits the model using the same vocab and the same lookup table correspondingly. My case is the extension of the vocab step by step (I don’t have enough computational possibilities in order to process 10 million documents at once plus I retrieve them from the server using REST API and the payload also has its size limitations). This extension also requires the extension of the lookup table.

Regarding the native JVM crash - there are no hs-err.log files for this case. Do I need to start my JVM with special options in order to get those?
Also is there a comprehensive way to somehow debug the native method execution for ND4J operations? I’d really like to find out what’s missing in order to get that stuff working.

No there isn’t anything special you need. It should work by default. You can set some different modes in nd4j to get it to output more debugging information though.
You can do that with:

        Nd4j.getEnvironment().setDebug(true);

You can also add new vocab words using the same word2vec uptraining. Word2vec uptraining just involves having a set of word vectors for each word you want in your vocab. Just set a proper sentence iterator as you would anything else and it should automatically add more words. Do you mind trying that and seeing if you run in to any issues? New words should get added automatically.

I tried it - still no useful info on the console and no error logs in the project folder.

Tried it using the corresponding modifications for ParagraphVectors. As a result I got the following observations:

  1. If the ParagraphVectors builder has no explicit resetModel set to false then the vocabulary is not being extended and the execution is successful
  2. If the ParagraphVectors builder has resetModel=false then the vocabulary is extended and after the extension the next call to fit() causes the mentioned native execution crash
  3. Saving/loading the model to/from the file (the same as in the example) has no impact on the behavior
  4. Updating an existing model with a new iterator (which points to new sequences) and running fit() doesn’t extend a vocabulary when resetModel=false and no crash is there.
  5. Updating an existing model with a new iterator (which points to new sequences), explicitly calling the buildVocab() in order to extend it and running fit() after that causes a crash.

So I played with different combinations and noticed that recreating a model using the same builder which created it but changing the sequence iterator and rebuilding the vocabulary in a specific way brings me to the point which I need. And I got a successful result!
I had to re-implement the vocabulary building because the fit() method doesn’t call buildVocab() if you don’t want to reset the model. Also I’ve noticed that VocabConstructor.buildJointVocabulary() doesn’t
mark the tokens which are already in the vocabulary as labels. So I fixed this part and it all worked and all the labels with all new words were there. I got no resetting of the lookup table, I got all labels preserved and the vocabulary extended. The main difference between the example you’ve referenced to and my implementation is that I recreate the model from scratch using the same builder which created it and after it I call

  ((InMemoryLookupTable<VocabWord>) paragraphVectors.getLookupTable())
                            .consume(oldWeightLookupTable); 

And it works! I tried the bulder without this line and I got the crash. Also I tried updating the existing model with new iterator as in the example and using this line but it still crashed. Only recreating a model using the builder (which itself preserves the lookup table and the vocabulary) together with consuming it’s own lookup table works fine. That means that the problem lies in any of ParagraphVectors fields which is re-initialized using a builder build() method call and thus it prevents the model from crashing during the next fit() after the vocabulary extension. I couldn’t identify, which field/s exactly is/are crucial here but if it’s not re-initialized (like in the example you’ve provided) - the fit() after vocabulary extension causes the crash.