Doc2Vec question

(Java) Hello. I am looking for the solution to identify if the text can be written by the given author.
I found out about word2vec, but then one person pointed out that word2vec only shows that 1 text tells about the same thing as the other. So i addressed the Doc2Vec.

Does Doc2Vec suitable for evaluating a few a4 text pages? As far as i understood, every time i want to compare some text, i need to get all the pages from author, feed them to model, and then ask it to compare with the new text?
And is there any example on using the DL4J Doc2Vec with existing word model? Or i do not need pretrained word model for the Doc2Vec(intend to use google’s news word2vec model)?

Can i compile all text’s from the author in to 1 large document and then just compile it to a vector. And then just take vector of any new document and then cosine it with existing vector of the author?

The intended use is

  1. For each existing author in DB i create a model.
  2. Each author gets it’s vector (all his text’s compiled in to one with label being AuthorName and vectorized)
  3. After it i for example get some text, vectorize it, and compare this vector with every author by cosine.
    Any existing examples for this kinda use? I found a lot for word2vec, but nothing in java for doc2vec.

What is the best approach to it?
If i am constantly adding new text to authors or new authors, keeping one giant model would mean training it after every added label(author) or text. But maybe i am wrong and it works differently.
Other choice is to do the vector(but i do not know how accurate the vector for the label would be without all the other authors)

There are many different ways that you can handle converting a document to a vector.

DL4J at the moment does not implement a specific model like doc2vec for it.

What you are looking to create is something that produces “author embeddings” when given a text, and we do not have anything pre-created for you, so you’ll have to do some research of your own on how actual models for that kind of problem are structured.

If you need more functionality than DL4J itself offers with its high level MultiLayerNetwork or ComputationGraph API’s, you might find SameDiff useful once you’ll get to actually implementing your model.

But what about paragraph vector? Isn’t it the needed part?

Paragraph Vectors works better for creating a representation of the content. You certainly can still try to use it that way. But you will most likely have to retrain it every time you add a new author. Also note that you can’t use plain word embeddings (e.g. google’s word2vec) for the word vectors that it requires, as they don’t provide the necessary information that is needed for paragraph vectors.

So i can not make model that trains to label content, where labels are authors with dl4j tools?

And yes i was thinking about it, but with any solution that vectorizes the doc and not the words i would need to retrain it each time i add new text to author.

So for now the best solution for me seems to do word2vec average of all words in text, and as second statistic, use tf-idf.

You can do that, but this is exactly the case where I say that you’ll have to retrain as you add new authors.

That way you can at least establish a base line for other ideas that you try. But here are a few problems that this approach has anyway:

  • Taking just the average does not retain word order - the same is also true for tf-idf.
  • As words alone are your primary input, your vectors will be heavily biased by the actual content of the document, different authors writing about the same topic will produce very similar vectors

So can you recommend any library with document vectorization? Other ways than just sum all word vectors or do cosine with tf-idf with org.apache.commons.text.similarity.CosineSimilarity i did not found.

Document vectorization itself isn’t likely to really help you, as those methods are targeted towards finding a vector that represents the contents of a document well - often in a way that will make your task actually harder to solve.

I guess the closest thing to your task (author detection), is likely to be plagiarism detection. Try taking a look at the research on that topic.

I’d suggest that you try to get a baseline using the simplest possible way that you can think of first and only then you try to use more elaborate techniques like deep learning. Going with just a gut feeling, I’d say that TF-IDF with the IDF term being author dependent may work - at least if your authors tend to use words with different frequencies. If they have some key phrases that they use often, you might want to extend this to an n-gram model.

Ok, can you help with pom.xml dependency? Idk maybe it is obvious for someone but i have a hard time finding it. It is in org.deeplearning4j.models.word2vec but i do not see it in pom

What does getWordVector do if it does not find a word? Manual seems to say nothing about it.

It is in the deeplearning4j-nlp package.

It depends on the configuration. If it is configured to use an “UNK” token, then it will return the value for that. If it isn’t, you will get null.