Data slicing failed

Why some slices succeed and some slices fail.

INDArray data = Nd4j.rand(17379, 59);
int casualIndex = 56;
int cntIndex = 58;
INDArray features = data.get(NDArrayIndex.all(), NDArrayIndex.interval(0, casualIndex));
INDArray targets = data.get(NDArrayIndex.all(), NDArrayIndex.interval(casualIndex, cntIndex + 1));

int total_row = (int) data.shape()[0];
int val_size = total_row - 21 * 24;
int train_size = val_size - 60 * 24;
INDArray train_features = features.get(NDArrayIndex.interval(0, train_size), NDArrayIndex.all());
INDArray train_targets = targets.get(NDArrayIndex.interval(0, train_size), NDArrayIndex.all());
INDArray val_features = features.get(NDArrayIndex.interval(train_size, val_size), NDArrayIndex.all());
INDArray val_targets = targets.get(NDArrayIndex.interval(train_size, val_size), NDArrayIndex.all());
INDArray test_features = features.get(NDArrayIndex.interval(val_size, total_row), NDArrayIndex.all());
INDArray test_targets = targets.get(NDArrayIndex.interval(val_size, total_row), NDArrayIndex.all());

System.out.printf("train_features shape:%s\n", Arrays.toString(train_features.shape()));
System.out.printf("train_targets shape:%s\n", Arrays.toString(train_targets.shape()));
System.out.printf("val_features shape:%s\n", Arrays.toString(val_features.shape()));
System.out.printf("val_targets shape:%s\n", Arrays.toString(val_targets.shape()));
System.out.printf("test_features shape:%s\n", Arrays.toString(test_features.shape()));
System.out.printf("test_targets shape:%s\n", Arrays.toString(test_targets.shape()));

Run the results

train_features shape:[15435, 56]
train_targets shape:[15435, 3]
val_features shape:[1440, 56]
val_targets shape:[ ]
test_features shape:[ ]
test_targets shape:[ ]

@luuu could you clarify what you expect the results to be for each one and if you got any exceptions? Thanks! Otherwise I have to guess what you want the result to be. I might be able to help reword your code to give the correct result.

@agibsonccc The original data is first divided into features and labels. Then the features and labels are divided into training, validation, and testing again.

@luuu just to clarify I’m purely talking about the shapes of the data. Try to give me something self contained I can use with the shapes pre defined. Start from the original dataset with an expected full shape (say 1000 rows by whatever columns) and then give me the indexing calls you’re trying to do.
Try to make it so I don’t have to spend time reverse engineering your specific problem context.

1 Like

@agibsonccc
Using dup() before slicing will get the correct result, but it is not a great way to do it.

INDArray data = Nd4j.rand(17379, 59);
int casualIndex = 56;
int cntIndex = 58;
INDArray features = data.get(NDArrayIndex.all(), NDArrayIndex.interval(0, casualIndex));
INDArray targets = data.get(NDArrayIndex.all(), NDArrayIndex.interval(casualIndex, cntIndex + 1));

int total_row = (int) data.shape()[0];
int val_size = total_row - 21 * 24;
int train_size = val_size - 60 * 24;
INDArray train_features = features.get(NDArrayIndex.interval(0, train_size), NDArrayIndex.all());
INDArray train_targets = targets.get(NDArrayIndex.interval(0, train_size), NDArrayIndex.all());
INDArray val_features = features.get(NDArrayIndex.interval(train_size, val_size), NDArrayIndex.all());
INDArray val_targets = targets.dup().get(NDArrayIndex.interval(train_size, val_size), NDArrayIndex.all());
INDArray test_features = features.dup().get(NDArrayIndex.interval(val_size, total_row), NDArrayIndex.all());
INDArray test_targets = targets.dup().get(NDArrayIndex.interval(val_size, total_row), NDArrayIndex.all());

System.out.printf("train_features shape:%s\n", Arrays.toString(train_features.shape()));
System.out.printf("train_targets shape:%s\n", Arrays.toString(train_targets.shape()));
System.out.printf("val_features shape:%s\n", Arrays.toString(val_features.shape()));
System.out.printf("val_targets shape:%s\n", Arrays.toString(val_targets.shape()));
System.out.printf("test_features shape:%s\n", Arrays.toString(test_features.shape()));
System.out.printf("test_targets shape:%s\n", Arrays.toString(test_targets.shape()));

I expect the effect of wanting to slice as follows:
train_features shape:[15435, 56]
train_targets shape:[15435, 3]
val_features shape:[1440, 56]
val_targets shape:[1440, 3]
test_features shape:[504, 56]
test_targets shape:[504, 3]

@luuu almost there. Yeah dup() should not be a requirement. I mainly want to make sure that I have something reproducible. Thanks for working with me here. Let me run your code in a bit and I’ll try to troubleshoot this.

Just to check you’re using M2.1 right?

@agibsonccc Yes, the version I use is 1.0.0-M2.1.

@luuu awesome thanks. Let me check if this is already fixed on the latest version first.

@agibsonccc
I may need to add some information. There are 2 use cases below, one row count is relatively small, and the slices are small number of failures; Another row count is relatively large, and most of the slices fail.

row count is relatively small:

INDArray data = Nd4j.rand(17379 / 3, 59);
int casualIndex = 56;
int cntIndex = 58;
INDArray features = data.get(NDArrayIndex.all(), NDArrayIndex.interval(0, casualIndex));
INDArray targets = data.get(NDArrayIndex.all(), NDArrayIndex.interval(casualIndex, cntIndex + 1));


int total_row = (int) data.shape()[0];
int val_size = total_row - 21 * 24;
int train_size = val_size - 60 * 24;
INDArray train_features = features.get(NDArrayIndex.interval(0, train_size), NDArrayIndex.all());
INDArray train_targets = targets.get(NDArrayIndex.interval(0, train_size), NDArrayIndex.all());
INDArray val_features = features.get(NDArrayIndex.interval(train_size, val_size), NDArrayIndex.all());
INDArray val_targets = targets.get(NDArrayIndex.interval(train_size, val_size), NDArrayIndex.all());
INDArray test_features = features.get(NDArrayIndex.interval(val_size, total_row), NDArrayIndex.all());
INDArray test_targets = targets.get(NDArrayIndex.interval(val_size, total_row), NDArrayIndex.all());

System.out.printf("train_features shape:%s\n", Arrays.toString(train_features.shape()));
System.out.printf("train_targets shape:%s\n", Arrays.toString(train_targets.shape()));
System.out.printf("val_features shape:%s\n", Arrays.toString(val_features.shape()));
System.out.printf("val_targets shape:%s\n", Arrays.toString(val_targets.shape()));
System.out.printf("test_features shape:%s\n", Arrays.toString(test_features.shape()));
System.out.printf("test_targets shape:%s\n", Arrays.toString(test_targets.shape()));

Run the results:
train_features shape:[3849, 56]
train_targets shape:[3849, 3]
val_features shape:[1440, 56]
val_targets shape:[ ]
test_features shape:[504, 56]
test_targets shape:[ ]

row count is relatively large:

INDArray data = Nd4j.rand(17379 * 3, 59);
int casualIndex = 56;
int cntIndex = 58;
INDArray features = data.get(NDArrayIndex.all(), NDArrayIndex.interval(0, casualIndex));
INDArray targets = data.get(NDArrayIndex.all(), NDArrayIndex.interval(casualIndex, cntIndex + 1));


int total_row = (int) data.shape()[0];
int val_size = total_row - 21 * 24;
int train_size = val_size - 60 * 24;
INDArray train_features = features.get(NDArrayIndex.interval(0, train_size), NDArrayIndex.all());
INDArray train_targets = targets.get(NDArrayIndex.interval(0, train_size), NDArrayIndex.all());
INDArray val_features = features.get(NDArrayIndex.interval(train_size, val_size), NDArrayIndex.all());
INDArray val_targets = targets.get(NDArrayIndex.interval(train_size, val_size), NDArrayIndex.all());
INDArray test_features = features.get(NDArrayIndex.interval(val_size, total_row), NDArrayIndex.all());
INDArray test_targets = targets.get(NDArrayIndex.interval(val_size, total_row), NDArrayIndex.all());

System.out.printf("train_features shape:%s\n", Arrays.toString(train_features.shape()));
System.out.printf("train_targets shape:%s\n", Arrays.toString(train_targets.shape()));
System.out.printf("val_features shape:%s\n", Arrays.toString(val_features.shape()));
System.out.printf("val_targets shape:%s\n", Arrays.toString(val_targets.shape()));
System.out.printf("test_features shape:%s\n", Arrays.toString(test_features.shape()));
System.out.printf("test_targets shape:%s\n", Arrays.toString(test_targets.shape()));

Run the results:
train_features shape:[50193, 56]
train_targets shape:[50193, 3]
val_features shape:[ ]
val_targets shape:[ ]
test_features shape:[ ]
test_targets shape:[ ]

If the slice is using raw data, it can succeed despite the increase in the number of rows in the data.

INDArray data = Nd4j.rand(17379 * 3, 59);
int casualIndex = 56;
int cntIndex = 58;
int total_row = (int) data.shape()[0];
int val_size = total_row - 21 * 24;
int train_size = val_size - 60 * 24;
INDArray train_features = data.get(NDArrayIndex.interval(0, train_size), NDArrayIndex.interval(0, casualIndex));
INDArray train_targets = data.get(NDArrayIndex.interval(0, train_size), NDArrayIndex.interval(casualIndex, cntIndex + 1));
INDArray val_features = data.get(NDArrayIndex.interval(train_size, val_size), NDArrayIndex.interval(0, casualIndex));
INDArray val_targets = data.get(NDArrayIndex.interval(train_size, val_size), NDArrayIndex.interval(casualIndex, cntIndex + 1));
INDArray test_features = data.get(NDArrayIndex.interval(val_size, total_row), NDArrayIndex.interval(0, casualIndex));
INDArray test_targets = data.get(NDArrayIndex.interval(val_size, total_row), NDArrayIndex.interval(casualIndex, cntIndex + 1));
System.out.printf("train_features shape:%s\n", Arrays.toString(train_features.shape()));
System.out.printf("train_targets shape:%s\n", Arrays.toString(train_targets.shape()));
System.out.printf("val_features shape:%s\n", Arrays.toString(val_features.shape()));
System.out.printf("val_targets shape:%s\n", Arrays.toString(val_targets.shape()));
System.out.printf("test_features shape:%s\n", Arrays.toString(test_features.shape()));
System.out.printf("test_targets shape:%s\n", Arrays.toString(test_targets.shape()));

Run the results:
train_features shape:[50193, 56]
train_targets shape:[50193, 3]
val_features shape:[1440, 56]
val_targets shape:[1440, 3]
test_features shape:[504, 56]
test_targets shape:[504, 3]

@luuu I took the time to run everything today locally and it appears to work (I checked with numpy indexing). Could you try this and let me know?

@agibsonccc I tested it using the snapshot version, and when I sliced the data again with the sliced data, there was still a problem, and the sliced out had a dimension that was wrong. The specific test code and test results are as follows

package org.example;

import org.nd4j.linalg.api.ndarray.INDArray;
import org.nd4j.linalg.factory.Nd4j;
import org.nd4j.linalg.indexing.NDArrayIndex;

import java.util.Arrays;

public class ND4JSlice {

    public static void main(String[] args) {
        System.out.println("\nslice from small rows");
        test0();
        System.out.println("\nslice from large rows");
        test1();
        System.out.println("\nslice from raw rows");
        test2();
    }

    private static void test0() {
        test_slice(Nd4j.rand(17379 / 3, 59));
    }

    private static void test1() {
        test_slice(Nd4j.rand(17379 * 3, 59));
    }

    private static void test_slice(INDArray data) {
        int casualIndex = 56;
        int cntIndex = 58;
        INDArray features = data.get(NDArrayIndex.all(), NDArrayIndex.interval(0, casualIndex));
        INDArray targets = data.get(NDArrayIndex.all(), NDArrayIndex.interval(casualIndex, cntIndex + 1));

        int total_row = (int) data.shape()[0];
        int val_size = total_row - 21 * 24;
        int train_size = val_size - 60 * 24;
        INDArray train_features = features.get(NDArrayIndex.interval(0, train_size), NDArrayIndex.all());
        INDArray train_targets = targets.get(NDArrayIndex.interval(0, train_size), NDArrayIndex.all());
        INDArray val_features = features.get(NDArrayIndex.interval(train_size, val_size), NDArrayIndex.all());
        INDArray val_targets = targets.get(NDArrayIndex.interval(train_size, val_size), NDArrayIndex.all());
        INDArray test_features = features.get(NDArrayIndex.interval(val_size, total_row), NDArrayIndex.all());
        INDArray test_targets = targets.get(NDArrayIndex.interval(val_size, total_row), NDArrayIndex.all());


        System.out.printf("train_features shape:%s\n", Arrays.toString(train_features.shape()));
        System.out.printf("train_targets shape:%s\n", Arrays.toString(train_targets.shape()));
        System.out.printf("val_features shape:%s\n", Arrays.toString(val_features.shape()));
        System.out.printf("val_targets shape:%s\n", Arrays.toString(val_targets.shape()));
        System.out.printf("test_features shape:%s\n", Arrays.toString(test_features.shape()));
        System.out.printf("test_targets shape:%s\n", Arrays.toString(test_targets.shape()));
    }

    private static void test2() {
        INDArray data = Nd4j.rand(17379 * 3, 59);
        int casualIndex = 56;
        int cntIndex = 58;
        int total_row = (int) data.shape()[0];
        int val_size = total_row - 21 * 24;
        int train_size = val_size - 60 * 24;

        INDArray train_features = data.get(NDArrayIndex.interval(0, train_size), NDArrayIndex.interval(0, casualIndex));
        INDArray train_targets = data.get(NDArrayIndex.interval(0, train_size), NDArrayIndex.interval(casualIndex, cntIndex + 1));
        INDArray val_features = data.get(NDArrayIndex.interval(train_size, val_size), NDArrayIndex.interval(0, casualIndex));
        INDArray val_targets = data.get(NDArrayIndex.interval(train_size, val_size), NDArrayIndex.interval(casualIndex, cntIndex + 1));
        INDArray test_features = data.get(NDArrayIndex.interval(val_size, total_row), NDArrayIndex.interval(0, casualIndex));
        INDArray test_targets = data.get(NDArrayIndex.interval(val_size, total_row), NDArrayIndex.interval(casualIndex, cntIndex + 1));

        System.out.printf("train_features shape:%s\n", Arrays.toString(train_features.shape()));
        System.out.printf("train_targets shape:%s\n", Arrays.toString(train_targets.shape()));
        System.out.printf("val_features shape:%s\n", Arrays.toString(val_features.shape()));
        System.out.printf("val_targets shape:%s\n", Arrays.toString(val_targets.shape()));
        System.out.printf("test_features shape:%s\n", Arrays.toString(test_features.shape()));
        System.out.printf("test_targets shape:%s\n", Arrays.toString(test_targets.shape()));
    }

}

Run the results
slice from small rows
train_features shape:[3849, 56]
train_targets shape:[3849, 3]
val_features shape:[1440, 56]
val_targets shape:[1440, 0]
test_features shape:[504, 56]
test_targets shape:[504, 0]

slice from large rows
train_features shape:[50193, 56]
train_targets shape:[50193, 3]
val_features shape:[1440, 0]
val_targets shape:[1440, 0]
test_features shape:[504, 0]
test_targets shape:[504, 0]

slice from raw rows
train_features shape:[50193, 56]
train_targets shape:[50193, 3]
val_features shape:[1440, 56]
val_targets shape:[1440, 3]
test_features shape:[504, 56]
test_targets shape:[504, 3]

Process finished with exit code 0

@luuu hmm…let me confirm and maybe update the snapshots. I’ll do that tomorrow. Is a cpu build fine for now?

@luuu sorry will publish a build tomorrow. Still in the middle of some other customer related work.