I am using paragraph vectors to label a dataset and would like to implement a multilayer network, does anyone know how to do it
@werty4777 could you clarify what you mean? It’s kind of hard to understand from the context here. You mention paragraph vectors (which itself is a classifier) and you want to build a separate document classifier based on…what?
You want to save the labels from paragraph vectors to a dataset and create a MLN based on that?
Beyond that…it really depends on how you want to vectorize the documents. Do you want to use bag of words/TFIDF? More complex features? What exactly do you expect to go in to the MLN?
Sorry if I’m asking a bunch of questions that might sound complicated but there’s really no 1 answer for this. As sometimes with really anything with ML you have to be aware of a few different approaches and then see what actually works in practice.
I would personally be concerned about this approach by itself as there’s no guarantees of the accuracy of paragraphvectors itself and you should try to verify its results.
@agibsonccc Well, I’ll explain what I’m trying to do is a document classifier, I have documents such as invoices, purchase orders etc.
What I do is extract the information and assign a label to each dataset, and that way I feed ParaGraphVector using SimpleLabelAwareIterator and also use BoW.
But I would like to implement a multilayer network
Sorry if I’m not explaining myself well, I’m new to this.
@werty4777 then you have 2 different problems:
-
Labeling your dataset. This can be done different ways but for a simple bag of words approach text files with a folder as the label is pretty easy to setup.
-
You actually just have 2 different kinds of models you’re trying to build. Paragraph vectors and the multi layernetwork are 2 different models. For the second one try a simple TFIDF or bag of words approach first and see how well that does. Ideally the simpler the network the better.
Don’t try to mix the 2. Just treat them as 2 separate models you are trying to build. Feel free to post more questions if you have something specific to setup.
Either way first just focused on getting your dataset labeled first and nothing else. Then you can try different models.
@agibsonccc
many thanks for your help <3
Sorry for so many questions but I am very interested in learning, I found a class called BagOfWordsVectorizer, could you explain me what it is for and if it could be implemented with a multilayer network.
@werty4777 I would recommend TFIDF over BagofWords since it is just word counts but feel free to use it as a starting point. Your dataset could be something like this:
LabelledDocument doc1 = new LabelledDocument();
doc1.addLabel("dog");
doc1.setContent("it barks like a dog");
LabelledDocument doc2 = new LabelledDocument();
doc2.addLabel("cat");
doc2.setContent("it meows like a cat");
List<LabelledDocument> docs = new ArrayList<>(2);
docs.add(doc1);
docs.add(doc2);
LabelAwareIterator iterator = new SimpleLabelAwareIterator(docs);
TokenizerFactory tokenizerFactory = new DefaultTokenizerFactory();
TfidfVectorizer vectorizer = new TfidfVectorizer
.Builder()
.setMinWordFrequency(1)
.setStopWords(new ArrayList<String>())
.setTokenizerFactory(tokenizerFactory)
.setIterator(iterator)
.allowParallelTokenization(false)
.build();
vectorizer.fit();
DataSet dataset = vectorizer.vectorize("it meows like a cat", "cat");
Then that would be your dataset for training. Of course there are other considerations like train/test split and the like so figure out one step at a time first. Once you know how to convert data to a labeled dataset for training then you can try building a model.
@agibsonccc muchas gracias por tu ayuda <3 me sirvio mucho para entender un poco mas como utilizar dl4j