Great start on Word2Vec, but still a few questions

Recently I started using org.deeplearning4j.examples.advanced.modelling.embeddingsfromcorpus.word2vec for training a recent Dutch Wikipedia text download (2Gb), using GPU and Eclipse. I tested the trained word2vec-model (4.5Gb), and the results are quite promising I must say. I have a couple of questions though, on a) the trainingset requirements (the Wikipedia text download); b) the training-parameters and c) testing the trained network. I read https://deeplearning4j.konduit.ai/language-processing/word2vec , but couldn’t find the answers. Therefore I hope somebody can answer them, thanks a lot!

Questions:
1. Trainingset requirements:
a. I did a lot of replacements and parse-tricks in the Wikipedia text download, to get as much fluent sentences as possible. Looking at the code, it seems like the word2vec-window moves over the lines (instead of the sentences). Does this mean that the trainingset should have one sentence per line (like in example file raw_sentences.txt)? If so: that is almost impossible to do, or am I missing something?
b. Are there any other requirements to the dataset?

2. Training-parameters (settings of Word2Vec.Builder).
My main question is: what determines the neural net to finish the training-process? Is that the value of Epoch? I’ve set it to 1. Does that mean that the whole trainingset will be fed to the neural net exactly one time? If so: does ‘1’ make sense? Shouldn’t the trainingset be fed multiple times? And if so, how many times?
Or does the learningrate determine when the training-process finishes? I see in the log that the learningrate slowly decreases during training. In my case it stopped when the value was exactly 1.0E-4. Which made me think that the current learning rate value might be a stopcriterium.
These are the settings I’m using at the moment:
Word2Vec vec = new Word2Vec.Builder()
.minWordFrequency(5)
.iterations(1)
.layerSize(300)
.windowSize(9)
.iterate(iter)
.tokenizerFactory(t)
.build();

3. Testing the trained network
I used the following code to run the trained network:
word2Vec.wordsNearestSum(str, 10);
That works perfectly: it returns max 10 associations of the str String. But there are two situations that the wordsNearestSum-function crashes:
a. When str contains more than one word (separated by spaces).
Is there a way to feed multiple words to the trained network? It should be possible, because the before mentioned article on conduit.ai show the result of ‘New York Times’, which consists of three words.
b. When str contains a word that the model doesn’t recognize.
Is there a way to make the function not crash, but to return null or something?

Question 3 is resolved in the following way. Question 1 and 2 are still open.

Nobody? Is the deeplearning4j-community still active, I’m wondering? Hope so! Thanks!

@ASadon if you want guaranteed response times, I can point you to our commercial support page. We get to things when we can.

  1. This is relative to the tokenizer factory you use. There are different ones, and you can even implement your own. The default one just uses space separation.

  2. An epoch (typically only 1 is required but you can do more) is generally 1 pass through the dataset. The “iterations” is the epochs.

  3. The crash is relative to the code you’re running. I’m not sure what your issue there is exactly. The “crash” you’re referencing isn’t even in the question.

Sorry for the delay, but just a reminder that real people here answer questions who have jobs that are more than just managing this forum.

1 Like

It is still active, but as @agibsonccc said, the free support on the forums is on best effort basis, and sometimes it can take a while until someone who is qualified enough to answer a question has the time for it.

Now to elaborate on the answers you’ve received already:

  1. Because what determines a “sentence” is different in different languages and even problem settings, we only provide the basic version for you.
  2. the parameters you are asking about are all “hyper parameters”, i.e. they require some tuning to get the best results for your application.
  3. a) That only works if those three words are considered to be a single token. Again, what a token is or is not very much depends on the language and the problem you are trying to solve.

As you can see, natural language problems aren’t quite as clear cut as others (e.g. when dealing with pictures, a pixel is a pixel). Because Word2Vec has quite a bit of additional intrinsic problems, it has fallen out of fashion with modern models - and where it is used most people just use pretrained word vectors.

These days there are w2v compatible word vectors, along with information about the tokenization for them, available for many languages, as you are obviously aware of, since you’ve posted the link to one such repository yourself in a different thread already.

1 Like