Pipeline between TfidfRecordReader and RecordReaderDataSetIterator

Hello everyone.

I need some help to make a pipeline between TfidfRecordReader and RecordReaderDataSetIterator.

This is my code :

   int batchSize = 100;
    int seed = 123;

    File rootTrainingTxtFolder = new File("./TRAINING/TXT");
    String[] allowedFormats=new String[]{".txt"};
    FileSplit fileSplit = new FileSplit(rootTrainingTxtFolder ,allowedFormats,new Random(seed));

    Configuration config = new Configuration();
    config.setBoolean(RecordReader.APPEND_LABEL, true);
    config.setInt(TextVectorizer.MIN_WORD_FREQUENCY, 1);

    TfidfRecordReader recordReader = new TfidfRecordReader();
    recordReader.initialize(config, fileSplit);

    int nbrLabel = recordReader.getLabels().size();

    DataSetIterator trainIter = new RecordReaderDataSetIterator.Builder(recordReader, batchSize)
            .classification(1, nbrLabel)
            .build();

    trainIter.next();

And i’ve got an exception :

Exception in thread "main" java.lang.IllegalStateException: Cannot put array: array should have leading dimension of 1 and equal rank to output array. Attempting to put array of shape [9418] into output array of shape [100]
at org.nd4j.common.base.Preconditions.throwStateEx(Preconditions.java:641)
at org.nd4j.common.base.Preconditions.checkState(Preconditions.java:340)
at org.deeplearning4j.datasets.datavec.RecordReaderMultiDataSetIterator.putExample(RecordReaderMultiDataSetIterator.java:544)
at org.deeplearning4j.datasets.datavec.RecordReaderMultiDataSetIterator.convertWritablesHelper(RecordReaderMultiDataSetIterator.java:516)
at org.deeplearning4j.datasets.datavec.RecordReaderMultiDataSetIterator.convertWritables(RecordReaderMultiDataSetIterator.java:454)
at org.deeplearning4j.datasets.datavec.RecordReaderMultiDataSetIterator.convertFeaturesOrLabels(RecordReaderMultiDataSetIterator.java:364)
at org.deeplearning4j.datasets.datavec.RecordReaderMultiDataSetIterator.nextMultiDataSet(RecordReaderMultiDataSetIterator.java:327)
at org.deeplearning4j.datasets.datavec.RecordReaderMultiDataSetIterator.next(RecordReaderMultiDataSetIterator.java:213)
at org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.next(RecordReaderDataSetIterator.java:378)
at org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.next(RecordReaderDataSetIterator.java:453)
at org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.next(RecordReaderDataSetIterator.java:85)

What is wrong ? Need to use a WritableConverter ?

Just a word of warning here, that record reader has been removed from the next release, one of the reasons being that it isn’t used often and therefore didn’t see a lot of maintenance.

In general a TFIDF transform doesn’t really make sense in the context of deep learning. You get the drawbacks of two worlds here: Big, but usually sparse, input array (i.e. harder to train, requires more resources) and you still have the out of vocabulary problem.

What you are seeing here is a side-effect of a change done quite some time ago. Vectors used to be implicit matrices, with a change, I think introduced 2 versions ago, they became their own thing. What you are seeing here now is that the tfidf record reader hasn’t been updated to accommodate this.

I’d suggest you try a different kind of input.

Thank you very much for this clear and quick answer.
I have a set of invoices that I need to classify by supplier based on the fact that the invoices for the same supplier are similar.
I am thinking of using “ParagraphVectors” with a Cosine similarity.
Is this the best choice? Or are there other solutions?

Honestly, I think a deep learning based approach is most likely overkill for that task.

I bet you will have less issues and more success with simpler approaches. You’d vectorize your text still by applying TFIDF to it, but then a simple decision tree or naive bayes classifier is likely going to work better for you.