Are DataSetIterator.next() results stored during learning/training and inference beyond subsequent next() call?

sascha08-15 · January 18, 2022, 11:31am

Hello together,

I wrote a DataSetIterator that uses an NDArray to provide the results with a next(int) call.
On the subsequent next() / next(int) call, I put new data in the same NDArray and basically overwrite all the values stored in the previous call.
I don’t do a .dup() call, so memory wise / performance wise this is super, but depending on whether the data is stored somewhere, this could potentially have pretty bad side effect.
I wanted to check with the community.

I saw NeuralNetConfiguration.Builder().cacheMode(CacheMode) - what’s cached there?
Do you see any problems with the design during training and inference?

Thanks in advance and best regards!

agibsonccc · January 18, 2022, 1:59pm

@sascha08-15 no you would have to pre save the ndarrays and iterate over the batches that way. You may want to look at this one:

github.com

eclipse/deeplearning4j/blob/master/nd4j/nd4j-backends/nd4j-api-parent/nd4j-api/src/main/java/org/nd4j/linalg/dataset/MiniBatchFileDataSetIterator.java

/*
 *  ******************************************************************************
 *  *
 *  *
 *  * This program and the accompanying materials are made available under the
 *  * terms of the Apache License, Version 2.0 which is available at
 *  * https://www.apache.org/licenses/LICENSE-2.0.
 *  *
 *  *  See the NOTICE file distributed with this work for additional
 *  *  information regarding copyright ownership.
 *  * Unless required by applicable law or agreed to in writing, software
 *  * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 *  * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 *  * License for the specific language governing permissions and limitations
 *  * under the License.
 *  *
 *  * SPDX-License-Identifier: Apache-2.0
 *  *****************************************************************************
 */

This file has been truncated. show original

This will use dataset.save/load.

sascha08-15 · January 20, 2022, 2:52pm

Thanks for your answer, I understand the approach of storing it on a filesystem but wouldn’t that slow down the iterator tremendously (even if wrapped in the Async Iterator?) (hd vs memory)
To illustrate my case, it’s easiest if I share the next(int) implementation. The lines in question are those:

        var dataSet = new DataSet(allIn, allOut); //is this okay, as next() will not be stored in the NN beyond the subsequent call?
        var dataSet = new DataSet(allIn.dup(), allOut.dup()); //alternatively, one would have to dup()

The implementation is:


public class SlidingWindowDataSetIteratorANN implements DataSetIterator {

    @Getter(onMethod_ = {@Override})
    @Setter(onMethod_ = {@Override})
    private DataSetPreProcessor preProcessor;

    final private long rows;
    final private long cols;
    @Getter
    final private long numInputFeatures;
    @Getter
    final private long numOutputFeatures;

    final private INDArray features;
    final private INDArray labels;

    private INDArray mask;
    final private int windowSize;
    final private int miniBatchSize;

    @Getter
    private Long cursor = 0L;

    final private DataType dataType;

    final INDArray allIn;
    final INDArray allOut;

    public SlidingWindowDataSetIteratorANN(@NonNull INDArray data, INDArray mask, int numOutputFeatures,
                                           int miniBatchSize, int windowSize) {


        this.numOutputFeatures = numOutputFeatures;
        this.numInputFeatures = this.cols - numOutputFeatures;
        this.windowSize = windowSize;
        this.miniBatchSize = miniBatchSize;

        this.features = data.get(NDArrayIndex.all(), NDArrayIndex.interval(0, numInputFeatures))
                .reshape(rows * numInputFeatures)
                .dup();
        this.labels = data.get(NDArrayIndex.all(), NDArrayIndex.interval(numInputFeatures, cols))
                .reshape(rows * numOutputFeatures)
                .dup();
        this.mask = mask;
       // these two arrays are the output for next(int)
        this.allIn = Nd4j.create(dataType, new long[]{miniBatchSize, numInputFeatures * windowSize}, 'c');
        this.allOut = Nd4j.create(dataType, new long[]{miniBatchSize, numOutputFeatures}, 'c');

    }


    @Override
    public boolean asyncSupported() {
        return false;
    }

    @Override
    public void reset() {
        cursor = 0L;
    }

    @Override
    public DataSet next() {
        if (hasNext()) {
            return next(miniBatchSize);
        } else
            throw new IndexOutOfBoundsException("Cursor is at " + cursor + "; got " + rows + " rows; request of miniBatchSize " + miniBatchSize + " with windowSize " + windowSize + " invalid");
    }


    @Override
    public DataSet next(int num) {

        if (num != miniBatchSize) {
            throw new RuntimeException("num must equal batchSize");
        }

        for (int i = 0; i < num; i++) {

            long dataOffsetIn = (i + cursor) * numInputFeatures;
            DataBuffer buffIn = Nd4j.createBuffer(features.data(), dataOffsetIn, numInputFeatures * windowSize);
            INDArray in = Nd4j.create(buffIn, new long[]{windowSize * numInputFeatures});
            allIn.putRow(i, in);

            long dataOffsetOut = (i + cursor) * numOutputFeatures + windowSize - 1;
            DataBuffer buffOut = Nd4j.createBuffer(labels.data(), dataOffsetOut, numOutputFeatures);
            INDArray out = Nd4j.create(buffOut, new long[]{numOutputFeatures});
            allOut.putRow(i, out);

        }

        var dataSet = new DataSet(allIn, allOut); //is this okay as next() will not be stored in the NN until subsequent call?
        var dataSet = new DataSet(allIn.dup(), allOut.dup()); //alternatively, one would have to dup() 
        if (preProcessor != null) {
            if (!dataSet.isPreProcessed()) {
                preProcessor.preProcess(dataSet);
                dataSet.markAsPreProcessed();
            }
        }
        cursor += num;
        return dataSet;
    }

}

Thanks!

treo · January 20, 2022, 3:22pm

With the AsyncDataSetIterator that is usually wrapping the iterator you pass to fit, this is a bad idea.

However, if you are using your own training loop, where you provide just the dataset object to fit, then it should work. As far as I know the model doesn’t hold on to any of the inputs.

The approach you’ve chosen for your data set iterator strikes me as odd. If you provide some more context, we may be able to point you towards a better solution.

sascha08-15 · January 20, 2022, 3:35pm

Thanks for your answer, that helps! AsyncDataSetIterator caches, which could be disastrous when I override the underlaying DataBuffer with “put”. My question was more focused on the model and if there is any data (pointer) stored which I then override in the subsequent next() call.

But to the general context, I am happy to share some details and get ideas how to improve what I want to do:
The iterator is derived from the RNN version of it and flattens n-timesteps as features.

What I feed as data is a matrix of [ features | label ] with the past, let’s say 1 million rows (~2 years of data, which is stock prices & indicators, etc.). I did notice a big memory issue, using the GC to clean that up has a very negative impact on performance.

Below you find the RNN version of the iterator (just the relevant “next” part).

P.S: What it compares to with is the SequenceRecordReaderDataSetIterator from DataVec. But again: performance was the reason why I derived my own Iterator.

 public DataSet next(int num) {

      if (num != miniBatchSize) {
         throw new RuntimeException("num must equal batchSize");
      }

      var countDownLatch = new CountDownLatch(num);

      for (int i = 0; i < num; i++) {
               long dataOffset = (i + cursor) * cols;
               DataBuffer buff = Nd4j.createBuffer(data.data(), dataOffset, cols * windowSize);
               dataOffset = currentI + cursor; //masks just have 1 column
               DataBuffer maskBuffIn = Nd4j.createBuffer(maskIn.data(), dataOffset, windowSize);
               DataBuffer maskBuffOut = Nd4j.createBuffer(maskOut.data(), dataOffset, windowSize);

               // iteration contains features as well as labels
               INDArray iteration = Nd4j.create(buff, new long[]{windowSize, cols});
               INDArray maskIterationIn = Nd4j.create(maskBuffIn, new long[]{windowSize});
               INDArray maskIterationOut = Nd4j.create(maskBuffOut, new long[]{windowSize});

               INDArray in = iteration.get(NDArrayIndex.all(), NDArrayIndex.interval(0, numInputFeatures));
               in.setOrder('f');
               allIn.tensorAlongDimension(i, 1, 2).permutei(1, 0).assign(in);
               if (num == 1) {
                  allMaskIn.tensorAlongDimension(i, 1).assign(maskIterationIn);
               } else {
                  allMaskIn.tensorAlongDimension(i, 1).permutei(0).assign(maskIterationIn);
               }

               INDArray out = iteration.get(NDArrayIndex.all(), NDArrayIndex.interval(numInputFeatures, cols));
               out.setOrder('f');
               allOut.tensorAlongDimension(i, 1, 2).permutei(1, 0).assign(out);

               if (num == 1) {
                  allMaskOut.tensorAlongDimension(i, 1).assign(maskIterationOut);
               } else {
                  allMaskOut.tensorAlongDimension(i, 1).permutei(0).assign(maskIterationOut);
               }
      }

      DataSet dataSet = new DataSet(allIn, allOut, allMaskIn, allMaskOut);

      cursor += num; //moves on to next time window
      return dataSet;
   }

treo · January 20, 2022, 4:11pm

In that case I would probably first try to see how much of a difference explicitly closing the NDArray instances would make.

I would probably handle that in the same way I handled fast access to large word vector files: via mmap. A long time ago I wrote a draft for an implementation that never made it into DL4J: (BinaryWordVectorSerializer.java · GitHub)

If your file ends up being less than 2GB in size, it can be quite simple overall.

The nice thing about mmap is that you let your operating system handle all of the memory management. It will automatically read a page of data so for most cases the access to the data is quite quick. And if I recall correctly it will even do read-ahead.

Topic		Replies	Views
How to fix problems with DataSetIterators? DL4J	3	430	December 23, 2020
Correct resetable DataSetIterator for a Dataset? DL4J	0	376	June 17, 2021
Questions for Time Series LSTM DL4J	20	2027	February 26, 2020
Creating a DataIterator for the Google Quick, Draw! data set DL4J	2	390	August 17, 2020
Load Data Set Asynchronously From Data Base DL4J	1	439	September 20, 2021

Are DataSetIterator.next() results stored during learning/training and inference beyond subsequent next() call?

Related topics