Recently I started using org.deeplearning4j.examples.advanced.modelling.embeddingsfromcorpus.word2vec for training a recent Dutch Wikipedia text download (2Gb), using GPU and Eclipse. I tested the trained word2vec-model (4.5Gb), and the results are quite promising I must say. I have a couple of questions though, on a) the trainingset requirements (the Wikipedia text download); b) the training-parameters and c) testing the trained network. I read https://deeplearning4j.konduit.ai/language-processing/word2vec , but couldn’t find the answers. Therefore I hope somebody can answer them, thanks a lot!
1. Trainingset requirements:
a. I did a lot of replacements and parse-tricks in the Wikipedia text download, to get as much fluent sentences as possible. Looking at the code, it seems like the word2vec-window moves over the lines (instead of the sentences). Does this mean that the trainingset should have one sentence per line (like in example file raw_sentences.txt)? If so: that is almost impossible to do, or am I missing something?
b. Are there any other requirements to the dataset?
2. Training-parameters (settings of Word2Vec.Builder).
My main question is: what determines the neural net to finish the training-process? Is that the value of Epoch? I’ve set it to 1. Does that mean that the whole trainingset will be fed to the neural net exactly one time? If so: does ‘1’ make sense? Shouldn’t the trainingset be fed multiple times? And if so, how many times?
Or does the learningrate determine when the training-process finishes? I see in the log that the learningrate slowly decreases during training. In my case it stopped when the value was exactly 1.0E-4. Which made me think that the current learning rate value might be a stopcriterium.
These are the settings I’m using at the moment:
Word2Vec vec = new Word2Vec.Builder()
3. Testing the trained network
I used the following code to run the trained network:
That works perfectly: it returns max 10 associations of the str String. But there are two situations that the wordsNearestSum-function crashes:
a. When str contains more than one word (separated by spaces).
Is there a way to feed multiple words to the trained network? It should be possible, because the before mentioned article on conduit.ai show the result of ‘New York Times’, which consists of three words.
b. When str contains a word that the model doesn’t recognize.
Is there a way to make the function not crash, but to return null or something?