Are results stored during learning/training and inference beyond subsequent next() call?

Hello together,

I wrote a DataSetIterator that uses an NDArray to provide the results with a next(int) call.
On the subsequent next() / next(int) call, I put new data in the same NDArray and basically overwrite all the values stored in the previous call.
I don’t do a .dup() call, so memory wise / performance wise this is super, but depending on whether the data is stored somewhere, this could potentially have pretty bad side effect.
I wanted to check with the community.

I saw NeuralNetConfiguration.Builder().cacheMode(CacheMode) - what’s cached there?
Do you see any problems with the design during training and inference?

Thanks in advance and best regards!

@sascha08-15 no you would have to pre save the ndarrays and iterate over the batches that way. You may want to look at this one:

This will use

Thanks for your answer, I understand the approach of storing it on a filesystem but wouldn’t that slow down the iterator tremendously (even if wrapped in the Async Iterator?) (hd vs memory)
To illustrate my case, it’s easiest if I share the next(int) implementation. The lines in question are those:

        var dataSet = new DataSet(allIn, allOut); //is this okay, as next() will not be stored in the NN beyond the subsequent call?
        var dataSet = new DataSet(allIn.dup(), allOut.dup()); //alternatively, one would have to dup() 

The implementation is:

public class SlidingWindowDataSetIteratorANN implements DataSetIterator {

    @Getter(onMethod_ = {@Override})
    @Setter(onMethod_ = {@Override})
    private DataSetPreProcessor preProcessor;

    final private long rows;
    final private long cols;
    final private long numInputFeatures;
    final private long numOutputFeatures;

    final private INDArray features;
    final private INDArray labels;

    private INDArray mask;
    final private int windowSize;
    final private int miniBatchSize;

    private Long cursor = 0L;

    final private DataType dataType;

    final INDArray allIn;
    final INDArray allOut;

    public SlidingWindowDataSetIteratorANN(@NonNull INDArray data, INDArray mask, int numOutputFeatures,
                                           int miniBatchSize, int windowSize) {

        this.numOutputFeatures = numOutputFeatures;
        this.numInputFeatures = this.cols - numOutputFeatures;
        this.windowSize = windowSize;
        this.miniBatchSize = miniBatchSize;

        this.features = data.get(NDArrayIndex.all(), NDArrayIndex.interval(0, numInputFeatures))
                .reshape(rows * numInputFeatures)
        this.labels = data.get(NDArrayIndex.all(), NDArrayIndex.interval(numInputFeatures, cols))
                .reshape(rows * numOutputFeatures)
        this.mask = mask;
       // these two arrays are the output for next(int)
        this.allIn = Nd4j.create(dataType, new long[]{miniBatchSize, numInputFeatures * windowSize}, 'c');
        this.allOut = Nd4j.create(dataType, new long[]{miniBatchSize, numOutputFeatures}, 'c');


    public boolean asyncSupported() {
        return false;

    public void reset() {
        cursor = 0L;

    public DataSet next() {
        if (hasNext()) {
            return next(miniBatchSize);
        } else
            throw new IndexOutOfBoundsException("Cursor is at " + cursor + "; got " + rows + " rows; request of miniBatchSize " + miniBatchSize + " with windowSize " + windowSize + " invalid");

    public DataSet next(int num) {

        if (num != miniBatchSize) {
            throw new RuntimeException("num must equal batchSize");

        for (int i = 0; i < num; i++) {

            long dataOffsetIn = (i + cursor) * numInputFeatures;
            DataBuffer buffIn = Nd4j.createBuffer(, dataOffsetIn, numInputFeatures * windowSize);
            INDArray in = Nd4j.create(buffIn, new long[]{windowSize * numInputFeatures});
            allIn.putRow(i, in);

            long dataOffsetOut = (i + cursor) * numOutputFeatures + windowSize - 1;
            DataBuffer buffOut = Nd4j.createBuffer(, dataOffsetOut, numOutputFeatures);
            INDArray out = Nd4j.create(buffOut, new long[]{numOutputFeatures});
            allOut.putRow(i, out);


        var dataSet = new DataSet(allIn, allOut); //is this okay as next() will not be stored in the NN until subsequent call?
        var dataSet = new DataSet(allIn.dup(), allOut.dup()); //alternatively, one would have to dup() 
        if (preProcessor != null) {
            if (!dataSet.isPreProcessed()) {
        cursor += num;
        return dataSet;



With the AsyncDataSetIterator that is usually wrapping the iterator you pass to fit, this is a bad idea.

However, if you are using your own training loop, where you provide just the dataset object to fit, then it should work. As far as I know the model doesn’t hold on to any of the inputs.

The approach you’ve chosen for your data set iterator strikes me as odd. If you provide some more context, we may be able to point you towards a better solution.

Thanks for your answer, that helps! AsyncDataSetIterator caches, which could be disastrous when I override the underlaying DataBuffer with “put”. My question was more focused on the model and if there is any data (pointer) stored which I then override in the subsequent next() call.

But to the general context, I am happy to share some details and get ideas how to improve what I want to do:
The iterator is derived from the RNN version of it and flattens n-timesteps as features.

What I feed as data is a matrix of [ features | label ] with the past, let’s say 1 million rows (~2 years of data, which is stock prices & indicators, etc.). I did notice a big memory issue, using the GC to clean that up has a very negative impact on performance.

Below you find the RNN version of the iterator (just the relevant “next” part).

P.S: What it compares to with is the SequenceRecordReaderDataSetIterator from DataVec. But again: performance was the reason why I derived my own Iterator.

 public DataSet next(int num) {

      if (num != miniBatchSize) {
         throw new RuntimeException("num must equal batchSize");

      var countDownLatch = new CountDownLatch(num);

      for (int i = 0; i < num; i++) {
               long dataOffset = (i + cursor) * cols;
               DataBuffer buff = Nd4j.createBuffer(, dataOffset, cols * windowSize);
               dataOffset = currentI + cursor; //masks just have 1 column
               DataBuffer maskBuffIn = Nd4j.createBuffer(, dataOffset, windowSize);
               DataBuffer maskBuffOut = Nd4j.createBuffer(, dataOffset, windowSize);

               // iteration contains features as well as labels
               INDArray iteration = Nd4j.create(buff, new long[]{windowSize, cols});
               INDArray maskIterationIn = Nd4j.create(maskBuffIn, new long[]{windowSize});
               INDArray maskIterationOut = Nd4j.create(maskBuffOut, new long[]{windowSize});

               INDArray in = iteration.get(NDArrayIndex.all(), NDArrayIndex.interval(0, numInputFeatures));
               allIn.tensorAlongDimension(i, 1, 2).permutei(1, 0).assign(in);
               if (num == 1) {
                  allMaskIn.tensorAlongDimension(i, 1).assign(maskIterationIn);
               } else {
                  allMaskIn.tensorAlongDimension(i, 1).permutei(0).assign(maskIterationIn);

               INDArray out = iteration.get(NDArrayIndex.all(), NDArrayIndex.interval(numInputFeatures, cols));
               allOut.tensorAlongDimension(i, 1, 2).permutei(1, 0).assign(out);

               if (num == 1) {
                  allMaskOut.tensorAlongDimension(i, 1).assign(maskIterationOut);
               } else {
                  allMaskOut.tensorAlongDimension(i, 1).permutei(0).assign(maskIterationOut);

      DataSet dataSet = new DataSet(allIn, allOut, allMaskIn, allMaskOut);

      cursor += num; //moves on to next time window
      return dataSet;

In that case I would probably first try to see how much of a difference explicitly closing the NDArray instances would make.

I would probably handle that in the same way I handled fast access to large word vector files: via mmap. A long time ago I wrote a draft for an implementation that never made it into DL4J: ( · GitHub)

If your file ends up being less than 2GB in size, it can be quite simple overall.

The nice thing about mmap is that you let your operating system handle all of the memory management. It will automatically read a page of data so for most cases the access to the data is quite quick. And if I recall correctly it will even do read-ahead.