I need to vectorize text for multi-label classification and for this I thought about using Mulan (from what I read DL4J does not do multi-label classification). Is it possible to do vectorization using word2vec or paragraphvectors? The necessary format would be something like: doc_1, 0.01, 0.02, -0.12 …, x,y,z
doc_2, 0.03, -0.05, 0.14, x,y,z
doc_n …
where x,y,z are the labels.
@neyanog each document would be an embedding. You can yes.
could i have a code example on how to do this?
@neyanog just look in the examples. deeplearning4j-examples/dl4j-examples/src/main/java/org/deeplearning4j/examples/advanced/modelling/embeddingsfromcorpus/paragraphvectors at master · deeplearning4j/deeplearning4j-examples · GitHub
Please, it would be something like this? public static void main(String args) {
String filePath = "myFilePath";
FileDocumentIterator fileDocumentIterator = new FileDocumentIterator(filePath);
TokenizerFactory tokenizerFactory = new DefaultTokenizerFactory();
tokenizerFactory.setTokenPreProcessor(new CommonPreprocessor());
Word2Vec vec = new Word2Vec.Builder()
.minWordFrequency(1)
.layerSize(2)
.seed(42)
.windowSize(5)
.iterate(fileDocumentIterator)
.tokenizerFactory(tokenizerFactory)
//.epochs(10)
.build();
vec.fit();
InMemoryLookupTable<VocabWord> lookupTable = (InMemoryLookupTable<VocabWord>)vec.getLookupTable();
WeightLookupTable weightLookupTable = vec.lookupTable();
INDArray indArray = weightLookupTable.getWeights();
long cols = indArray.shape()[1];
List<double[]> vectorizedList = new ArrayList<>();
for(int i = 0; i < cols; i++) {
double[] col = indArray.getColumns(i).transpose().data().asDouble();
vectorizedList.add(col);
}
double [] linha1 = vectorizedList.get(1);
for(int i = 0; i <linha1.length;i++) {
System.out.print(linha1[i] + " ");
}
}