Inputing word2vec#wordsNearest only positive words in prediction step

Hi there,

I copy my same question from StackOverflow here, hoping for a more sure response:

Kindly, forgive that I am a noob in word embedding in general.

I have not only a word to predict similarity, but a list of words sometimes (sentence). All examples of word2vec in deeplearning4j train with a list of words but predict with a single word.
I know word2vec is capable for doing complex logical inferences, like what’s most similar to a king and a woman that is not a queen. But for my case, I assume it is just additions, like what the most similar to “a big cat”, it would be a tiger.

I tried:

INDArray wordVector1 = wordVectors.getWordVectorMatrix(sentence);
Collection<String> lst_2 = wordVectors.wordsNearest(wordVector1, 10);

Where sentence is a String. I also tried

Collection<String> lst_2 = wordVectors.wordsNearest(sentence, 10);

But it yields in the first nullPointerException and in the second an empty array, where I believe it shouldn’t.

I see another signature of wordVectors#wordsNearest accepting an “INDArray” but I don’t know what that is except that is a high performant data-structure.

Also seeing example here in “Tried using the wordNearest with only positive vectors” snippet, I tried:

Collection<String> lst_2 = wordVectors.wordsNearest(Arrays.asList(sentence.split(" ")), 10);

And org.deeplearning4j version = 0.9.1 is complaining about the type. Like one input is a collection, another collection is needed.

Any hints ?

@bacloud14

“INDArray”

please always make sure to understand the fundamentals. Stepping in to machine learning without knowing the basics of ndarrays is always going to be fraught with rabbit holes.

Dl4j 0.9.x
Dl4j 0.9.x is over 2 years old and shouldn’t be used. Our quickstart covers the main way you should get started:
https://deeplearning4j.konduit.ai/getting-started/quickstart

We have examples that do this out of the box, our quickstart links directly to the latest examples:

Word2vec isn’t as amazing as you think it is. Take a step back and look at NLP basics before just diving in trying to use a model for something it’s not meant for :slight_smile: I would look at even the bare most of NLP basics before you start diving in to these kinds of things.

One key concept is an ngram: n-gram - Wikipedia

Another is a vocabulary. All models have a set of possible words they learned including word2vec:
https://deeplearning4j.konduit.ai/language-processing/word2vec

You can’t just magic a word in to existence (let alone multiple words)

The models in dl4j work on a single unigram or 1 word. You can train word2vec on bi or trigrams if you want but that would mean building a new model. If you want to do that, I can show you how to proceed.

1 Like

Thank you @agibsonccc for your valuable reply. I admit that I am new and that probably the way I presented the problem was like a bottle in the sea or incoherent. I am a generalist programmer with an interest in Data science lately, and in this specific case, I am looking for a precipitate solution for a tiny Android application I am making, again, I know these algorithms are tricky.

Finally, digging examples and API, I could train on ngrams

String filePath = new ClassPathResource("myfile.shuffled").getFile().getAbsolutePath();
SentenceIterator iter = new BasicLineIterator(filePath);
TokenizerFactory t = new NGramTokenizerFactory(new DefaultTokenizerFactory(), 1, 3); //new DefaultTokenizerFactory();
t.setTokenPreProcessor(new CommonPreprocessor());

and

Word2Vec vec = new Word2Vec.Builder()
                .minWordFrequency(10)
                .layerSize(64)
                .windowSize(5)
                .iterate(iter)
                .tokenizerFactory(t)
                .build();
        LOGGER.info("Fitting Word2Vec model....");
        vec.fit();

It is giving fair results on my real world prediction data.

Thanks again.

@bacloud14 great to hear! Good luck then!

1 Like