Vectorizing with word2vec/paragraphvectors

neyanog · May 13, 2025, 12:16pm

I need to vectorize text for multi-label classification and for this I thought about using Mulan (from what I read DL4J does not do multi-label classification). Is it possible to do vectorization using word2vec or paragraphvectors? The necessary format would be something like: doc_1, 0.01, 0.02, -0.12 …, x,y,z
doc_2, 0.03, -0.05, 0.14, x,y,z
doc_n …
where x,y,z are the labels.

agibsonccc · May 20, 2025, 12:23pm

@neyanog each document would be an embedding. You can yes.

neyanog · May 30, 2025, 11:36am

could i have a code example on how to do this?

agibsonccc · June 17, 2025, 11:48am

@neyanog just look in the examples. deeplearning4j-examples/dl4j-examples/src/main/java/org/deeplearning4j/examples/advanced/modelling/embeddingsfromcorpus/paragraphvectors at master · deeplearning4j/deeplearning4j-examples · GitHub

neyanog · June 18, 2025, 5:51pm

Please, it would be something like this? public static void main(String args) {

		String filePath = "myFilePath";
		FileDocumentIterator fileDocumentIterator = new FileDocumentIterator(filePath);
		TokenizerFactory tokenizerFactory = new DefaultTokenizerFactory();
		tokenizerFactory.setTokenPreProcessor(new CommonPreprocessor());
		Word2Vec vec = new Word2Vec.Builder()
		        .minWordFrequency(1)
		        .layerSize(2) 
		        .seed(42)
		        .windowSize(5)			        
		        .iterate(fileDocumentIterator)
		        .tokenizerFactory(tokenizerFactory)
		        //.epochs(10)
		        .build();

		vec.fit();
		
		InMemoryLookupTable<VocabWord> lookupTable = (InMemoryLookupTable<VocabWord>)vec.getLookupTable();
		WeightLookupTable weightLookupTable = vec.lookupTable();
		INDArray indArray = weightLookupTable.getWeights();

		long cols = indArray.shape()[1];

		List<double[]> vectorizedList = new ArrayList<>();
		for(int i = 0; i < cols; i++) {
			double[] col = indArray.getColumns(i).transpose().data().asDouble();
			
			vectorizedList.add(col);
			
			
		}
		double [] linha1 = vectorizedList.get(1);
		for(int i = 0; i <linha1.length;i++) {
			System.out.print(linha1[i] + " ");
		}

			
}

agibsonccc · July 4, 2025, 9:58am

@neyanog sorry ping me with an at next time. That looks ok.

Topic		Replies	Views
Multilayer network DL4J	6	334	August 18, 2022
Using "ParagraphVectors": How to efficiently search in > 100k embeddings? DL4J	11	538	December 1, 2022
How to optimize textCNN with non-static word2vec DL4J	3	510	December 31, 2021
Doc2Vec question DL4J	10	555	April 13, 2020
Issue with fetching Label Value in DL4J paragraphVector model from vec.nearestLabel() method DataVec	8	265	September 14, 2023

Vectorizing with word2vec/paragraphvectors

Related topics