Making an iterator over a List

Hi everyone

I’m pretty new to dl4j and just getting started. I want to build a simple example where the input is a Map<Double, Double>, with the first Double being the feature/input (100 random number between 0 and 100), and the second Double being the target (1 if the input is >50, 0 otherwise).

But I can’t seem to produce an Iterator over the Data. Here’s what I’ve tried:

    public static DataSetIterator fromEntriesToDataSet(Map<Double, Double> entries) {
        final Double[] boxedDoubles = entries.keySet().toArray(new Double[entries.size()]);
        final double[] unboxedDoubles = Stream.of(boxedDoubles).mapToDouble(Double::doubleValue).toArray();

        INDArray inputNDArray = Nd4j.create(unboxedDoubles);

        final Double[] bools = entries.values().toArray(new Double[entries.size()]);
        INDArray outPut = Nd4j.create(Arrays.asList(bools));

        DataSet dataSet = new DataSet(inputNDArray, outPut);
        List<DataSet> listDs = dataSet.asList(); //Cannot convert to list: feature set rank must be in range 2 to 5 inclusive. Got shape: [100]
        return new ListDataSetIterator<>(listDs, entries.size());
    }

As you can see from the comment, it fails to convert the DataSet to a List, saying it’s got the wrong rank. What am I doing wrong?
Of course, if there’s an easier way to get an Iterator I’m all ears too. I just saw from examples that they usually put the data into INDArray s, and that these need arrays, so I figured I had to convert the data that way.

Looks like I found the mistake:
I made a new attempt, this time with 2 features. For whatever reason, this satisfied the program (why can’t we use 1-dimensional data?) Then I continued getting the error message, but this time it had a problem with the rank/dimension of the labels, not the features. The solution was to add the shape as a parameter when creating the INDArray. Here’s the new method:

    public static DataSetIterator fromEntriesToDataSet(Map<Double[], Double> entries) {
        final Double[][] boxedDoubles = entries.keySet().toArray(new Double[entries.size()][numFeatures]);
        double[][] unboxedDoubles = Utilities.unbox2DArrayOfDoubles(boxedDoubles, numFeatures);

        INDArray inputNDArray = Nd4j.create(unboxedDoubles);

        final Double[] bools = entries.values().toArray(new Double[entries.size()]);
        INDArray outPut = Nd4j.create(Utilities.unbox1DArrayOfDoubles(bools), entries.size(), 1);

        DataSet dataSet = new DataSet(inputNDArray, outPut);
        List<DataSet> listDs = dataSet.asList();
        return new ListDataSetIterator<>(listDs, entries.size());
    }

@MJL neural networks are about minibatches. Depending on the dataset, they are usually in the form of minibatch size by the number of columns.

If you do time series or images, they will have > 2 dimensinos.
What you usually do with this iterator is have it return 1 instance at a time wrapped as a dataset.

A “DataSet” in dl4j is basically an input example + a set of labels.

I’d go so far as to say, that the DataSet type is misnamed, it should rather be called MiniBatch or something like that, as for most cases it only ever holds a subset of the actual data set.

Anyway, for your situation, i.e. where you can effectively put the entire dataset into a single mini batch, the easiest solution is to just not create a DataSetIterator, and instead create a single DataSet object.

The DataSet object holds two tensors (and optionally masks for those tensors, but those aren’t necessary here). In your particular case the shape of those tensors should look like this:

  • Features: [numExamples, numFeatures], i.e. [100, 1] for the data in your first post, and [100, 2] for the data in your second post.
  • Labels: [numExamples, numLabels], i.e. [100,1] in your case (one could also model your problem with two classes and then it would be [100,2] instead).

As ND4J supports tensors of different ranks and shapes, and 1 is an especially ambiguous size in this regard, you’ve got to provide a shape along with your double array when turning it into a tensor.

As you’ve got boxed values on the input anyway, I’d probably transform the data something like this. But note, that is only really something that you should do with toy problems. For real world problems, this approach is too slow:

    DataSet mapToDataSet(Map<Double, Double> input){
        final INDArray features = Nd4j.create(input.size(), 1);
        final INDArray labels = Nd4j.create(input.size(), 1);
        int i = 0;
        for (final Map.Entry<Double, Double> entry : input.entrySet()) {
            features.put(i,0, entry.getKey());
            labels.put(i, 0, entry.getValue());          
            i++;
        }
        return new DataSet(features, labels);
    }

In the above example I’ve shown how you can do it for the case in your initial post, but I think you can trivially see how to apply the same approach to the two dimensional case you’ve got in your followup post.

Thanks! One question: Why would a problem with 2 classes have labels the shape of [100,2] ? Are they one-hot-encoded? Why not just have one dimension which contains a number based on which class it is?

They are one hot encoded because that allows you to use a multi-class cross entropy loss function. That way you get a nice probability distribution as your output.

Assume you have 10 classes and you have one output that uses regression to calculate which class it should be. What does the output “5” tell you? That it is class 6? That it isn’t quite certain if it is class 1 or class 10? What does the output “-1” mean? What does the output “12” mean?

The only time that a single output works for classification is when you’ve got just two classes and you can constrain it to be between 0 and 1. In every other case it doesn’t make any sense.

Even though it does work, having 2 outputs in that case still makes sense, as it gives you twice as much gradient information to work with.