BagOfWord questions

Hi,

I’m trying to replicate this TF example that simply creates a BagOfWord representation of a set of input sentences.

So far I’ve initialized a BagOfWord Vectorizer:

BagOfWordsVectorizer bagOfWords = new BagOfWordsVectorizer.Builder()
                    .setTokenizerFactory(tokenizerFactory)
                    .setIterator(sentencesIt)
                    .build();

but I’m not sure:

  • About the diff between buildVocab() and fit() or if any of the two are in fact needed to create the initial bag of words representation (does “build()” already takes care of this?). This should be the equivalent of the fit_on_texts call in the example
  • And how I can then later transform another sentence using this same BagOfWords (the texts_to_sequences call in the example).

Thanks for your help and happy to be pointed to any bagofword example I could reuse (tried to find one but failed)

buildVocab and fit are in this case indeed the exactly same thing.

But build doesn’t take care of it there, as it is simply the last call too the BagOfWordsVectorizer.Builder to tell it to actually build the vectorizer itself.

You build the vocab, because you use your original data to specify what words it should even know about. As you are going to use the vectorizer to build a fixed-size vector, you can only have a fixed number of words in your vocabulary.

You either user transform or vectorize depending on whether you have a label or not.

1 Like

Thanks for the hint. I have a follow-up question:

  • What’s the role of the labels in the creation of the BagOfWords? We obviously need to add the labels to the vectorized sentences at some point for training but I wonder why they could be needed when creating the bag of words. If it’s just a practical way to then get a DataSet object ready with vectorize then good but I wonder if I’m missing something.

  • Also it seems that I cannot add a LabelsSource as part of the Builder so not sure the best way to add them.

That is exactly it. It is just a convenience method.

As the BagOfWordsVectorizer doesn’t really care about any labels, it doesn’t need to get that.

Not sure I get it. If I want to use vectorize, I need to build the bagOfWords with the labelsSource (even if labels are just stored for future vectorize calls). But if I cannot add them as part of the Builder how can I add them? I don’t see any obvious setLabelsSource method.

I see. I’ve just checked the code, and yes, you are right, the Builder doesn’t provide a way for you to set the label source.

It may be an oversight - we often use Lombok’s @Builder annotation to create builders, so maybe it just didn’t get a setter because the person implementing that feature expected it to be created automatically.

I guess this particular functionality didn’t get used much, given that it hasn’t been touched in over 3 years and as far as I can tell you are the first one asking about it.

But if you want to change that, I guess a Pull Request will be appreciated.

Done. For now, I’ll just create a second bag of words for the labels and pass the transform of the sentence and the transform of the label to a DataSet object

If your labels are bags of words too, then the vectorize functionality wouldn’t have worked for your case anyway.

As you can see the label it produces is a one-hot encoded.

To give more context, I’m trying to create a simple intent detection for a chatbot. So my labels are strings (the names of the chatbot intents to be detected) and my training set are strings as well (a set of example sentences for each label). Then, I want my network to predict the right intent for a new user utterance provided as String.

I’m now trying to fit my neural network model by giving as input a DataSet (the number of sentences is small so I don’t need to read them from a file).

Since the DataSet constructor expects (from what I understood) a INDArray for the data and a second INDArray one for the label, I assumed I could iterate on the training sentences and, for each training sentence “ts” write something like :

myDataSet = new DataSet(myBagOfWordsFromTrainingSentences.transform(ts), 
                    myBagOfWordsFromLabels.transform(labelCorrespondingToTS)

and merge the list of created DataSets into a single one to be passed on to the myNeuralNetwork.fit

Since my labels are single word strings, the result of BagOfWordsFromLabels should be a vector with a size equals to the number of intents and where each label would be represented as a vector of zeros and a single 1 in the column corresponding to the label word.

But looking at the vectorize method I guess an easier way to do it is by calling FeatureUtil.toOutcomeVector and my solution was just overkilling

Yes, for simple one hot encoded outputs, this way is a lot easier.