I’m using “org.deeplearning4j.models.paragraphvectors.ParagraphVectors” to transform my text data into embeddings, ending up with a “INDArray” (256 dim vector) representation for each of my > 500k original text snippets. Lets call this the “library”

Now the task of “find the X best matching entries in the library” to a new given snippet is suggested as computing the embedding for the new snippet and finding the best cosine-distance based embedding vectors from the existing “library”.
Is there any smarter way then scanning through all the “library”, computing the cosine-distance and keeping the best X ?

The use case is having a large set of CV-like data (think e.g LinkedIn snippets) and on the other side having job/project description snippets.
We found that paragraph vectors of certain sections of a job offering and paragraph vectors of peoples CVs are “close/similar” (cosine distance) when also in reality the people are roughly a good candidate for the position.
So we use this for several matching features:

have a new position description … find the top matching people from the whole dataset

have a new person joining … find some initial potential interesting open jobs for that new guy

have a given job description … find “similar” jobs

So imho its not a standard classification task, but a more open nearest neighbor finding with a given new text snippet, that either comes from a new person or a new job offering.

Im currently using “inferVector” for the new text snippet and then the task is to find efficiently the closest other vectors from the known set of several 100k other vectors.

@jdo interesting. So the way the embedding sort of works with inferVector is a repeated sampling where it does a feed forward over each sentence. That process starts from a randomly initialized vector.

You want to find the most similar vectors relative to the final vector from that process? Is that the idea?

If I understand you correctly …yes. The output of the inferVector we store initially for the known > 100k text samples and when we have a “similarity request” from a new text (so a new job or a new person with a relevant CV snippet) we also do inferVector and then we match that output (the 255 dim-vector) against all the other (stored) vectors.
Is there a better way than scanning all in sequence ? k-d trees seem to be similar to linear scan in this high dimensional application

It depends on what you really want. 100k vectors is closer to zero than it is to 1M vectors. And even 1M vectors isn’t that much.

In the high dimensional space, if you are going for 100% accuracy, then bruteforce is likely to be a lot faster than more sophisticated search tree algorithms.

Typically, you’ll be interested in a cosine similarity between two vectors.That has a neat property:

That means, when the vectors are normalized, the cosine similarity is equal to the dot product. And the dot product between a query vector and an entire database of queries can be calculated as a single matrix multiplication. And modern GPUs are really good at that.

I’ve also looked into using a tree structure (VP-Tree) that is supposed to work better than k-d trees for higher dimension lookups. But it turned out that it could only really reduce the search space if you are interested in only the top few (< 5) results. For the top result, it had to check less than 50% of the search space. For the top 20 results, it had to check 99% of the search space, or accept less than 100% accuracy. And because that is highly branching code, even the top 1 case was slower than the bruteforce version.

If you must find something that has a better theoretical speed, you can look into implementing a Hierarchical Navigable Small World Graph. But I think it isn’t worth the effort for just 100k vectors.

Accuracy doesnt need to be 100%. But its easier to tolerate if the accuracy loss in in the ordering (e.g. the 5th best match is found as 8t best one) - but if the best 2 matches could be missing, thats a bigger disadvantage. We are usually interested in the best top-5 to top-20, depending on use-case.

As for the volume, in some cases its <100k entries, for others between 200k - 500k, but still under a million. Im currently using 255 dim embeddings and have not tested if there would be a real quality drop when using only 128 dim vectors.

This I dont fully get: I now have lets say 100k precomputed “org.nd4j.linalg.api.ndarray.INDArray” vectors as my database and a new “query” ( a new text, transformed to a new INDArray vector) - how could that search be batched up into few or even one single matrix multiplication ?

Think of your database as a matrix (or even keep it stored as a matrix): Every single vector is simply a row (or column) in the matrix. Then you multiply the matrix with your query. The entries in the resulting matrix are the pairwise similarities between every entry in your database and your query (or even queries).