Label and features in the same file for time series data (SequenceRecordReader.DataSetIterator)

absurd · February 16, 2021, 5:00pm

The SequenceRecordReaderDataSetIterator has a form where the features and the labels come from the same reader

/** Constructor where features and labels come from the SAME RecordReader (i.e., target/label is a column in the
     * same data as the features). Defaults to regression = false - i.e., for classification
     * @param reader SequenceRecordReader with data
     * @param miniBatchSize size of each minibatch
     * @param numPossibleLabels number of labels/classes for classification
     * @param labelIndex index in input of the label index. If in regression mode and numPossibleLabels > 1, labelIndex denotes the
     *                   first index for labels. Everything before that index will be treated as input(s) and
     *                   everything from that index (inclusive) to the end will be treated as output(s)
     */
    public SequenceRecordReaderDataSetIterator(SequenceRecordReader reader, int miniBatchSize, int numPossibleLabels,
                    int labelIndex) {
        this(reader, miniBatchSize, numPossibleLabels, labelIndex, false);
    }

Or is there some other shape that I’m not seeing? Like perhaps the label should be formatted as the very last time step?
1FFFFF
2FFFFF
3FFFFF
L

agibsonccc · February 19, 2021, 12:34pm

@absurd Take a look at a sample problem:
https://github.com/eclipse/deeplearning4j/blob/c523c4f0c728b2f78748f94e067413143d4f6967/deeplearning4j/deeplearning4j-core/src/test/java/org/deeplearning4j/datasets/datavec/RecordReaderDataSetiteratorTest.java#L155

Here isthe associated test files repo:
https://github.com/KonduitAI/dl4j-test-resources/tree/master/src/main/resources

Input Data:
https://github.com/KonduitAI/dl4j-test-resources/blob/master/src/main/resources/csvsequence_0.txt
https://github.com/KonduitAI/dl4j-test-resources/blob/master/src/main/resources/csvsequence_1.txt
https://github.com/KonduitAI/dl4j-test-resources/blob/master/src/main/resources/csvsequence_2.txt

Labels:
https://github.com/KonduitAI/dl4j-test-resources/blob/master/src/main/resources/csvsequencelabels_0.txt
https://github.com/KonduitAI/dl4j-test-resources/blob/master/src/main/resources/csvsequencelabels_1.txt
https://github.com/KonduitAI/dl4j-test-resources/blob/master/src/main/resources/csvsequencelabels_2.txt

It basically boils down to 1 per step. Hopefully that helps.

absurd · February 20, 2021, 8:58pm

Thanks. Thats not quite the solution i was looking for. I think it just further reinforces my conclusion that its impossible to encode what I needed in a single reader.

Here is my experience (code follows in the next post).

Case 1:
Separate readers, one for features one for labels
Features hardcoded are in 3x4 grid, 3 features, 4 timesteps
Labels are hardcoded to be 1.

After passing throught the iterator

Features       Labels
  3x4            1x1
 [f,f,f
  f,f,f   -->    [l]
  f,f,f
  f,f,f]

This makes sense the whole time sequence, (3x4) received a single label (1x1)

Case 2:

Here im trying to reproduce the above behavior (3x4) → (1x1) with a single reader.

The features remain in a hardcoded 3x4 matrix, but now the hardcoded label is appended to each row
making it a 4x4 matrix

When passed through the sequence iterator the features are (3x4), as before, but now labels are in a (1x4)

 SingleSrc    Features       Labels
   4x4         3x4            1x4
[f,f,f,l,    [f,f,f          [l,
 f,f,f,l,     f,f,f   -->     l,
 f,f,f,l      f,f,f           l
 f,f,f,l]     f,f,f]          l]

So my conclusion is its impossible to achieve the reduction [C x R] → [1x1] with a single
reader, instead you are limited to a reduction of [C x R] → [1 x R]

absurd · February 20, 2021, 9:09pm

Here is my code (Groovy) to test for the above conclusion.

       def w = { //convert int to a writable
                 return new org.datavec.api.writable.IntWritable(it)
            }

Case 1: Two readers one for features (3x4) one for labels (1x1)

List<List<Writable<Integer>>> lb = [[[new org.datavec.api.writable.IntWritable(1)]]]
def labelReader = new CollectionSequenceRecordReader(lb)

List<List<Writable<Integer>>> ft = [[[w(101),w(102),w(103)],
									 [w(104),w(105),w(106)],
									 [w(107),w(108),w(109)],
									 [w(110),w(111),w(112)]]]

def featureReader = new CollectionSequenceRecordReader(ft)

SequenceRecordReaderDataSetIterator iterator = new SequenceRecordReaderDataSetIterator(featureReader, labelReader, 10, -1, true)
def seq = iterator.next()
println "features: "+seq.features.shapeInfoToString()
println "labels: "+seq.labels.shapeInfoToString()

The output on this one is:
features: Rank: 3,Offset: 0
Order: f Shape: [1,3,4], stride: [1,1,3]
labels: Rank: 3,Offset: 0
Order: f Shape: [1,1,1], stride: [1,1,1]
This makes sense, the features are 3x4 and the labels are 1x1

Case 2: single reader 4x4, label are at index 3

List<List<Writable<Integer>>> ft2 = [[[w(101),w(102),w(103), w(1)],
								 [w(104),w(105),w(106), w(1)],
								 [w(107),w(108),w(109), w(1)],
								 [w(110),w(111),w(112), w(1)]]]

def singleReader = new CollectionSequenceRecordReader(ft2)

SequenceRecordReaderDataSetIterator iterator2 = new SequenceRecordReaderDataSetIterator(singleReader, 10, -1, 3, true)

def seq2 = iterator2.next()
println "features: "+seq2.features.shapeInfoToString()
println "labels: "+seq2.labels.shapeInfoToString()

The output here is :
features: Rank: 3,Offset: 0
Order: f Shape: [1,3,4], stride: [1,1,3]
labels: Rank: 3,Offset: 0
Order: f Shape: [1,1,4], stride: [1,1,1]

Here the features are 3x4 matrix as before, but the labels are now forced into 1x4 matrix.

agibsonccc · February 22, 2021, 12:30pm

@absurd sorry your question isn’t clear to me then. If you need your data in a specific shape, could you describe what your direct goal is? Like “data pipeline that produces ndarrays in this format with labels laid out like xyz” would help.

Maybe I can help with the solution rather than trying to dissect the interpretation of the problem.

From my guess you need the data in a specific layout and are trying to get it there.

Sorry for not quite understanding what you want to do here.

treo · February 22, 2021, 1:04pm

As @agibsonccc said, I think we need an idea for what you are trying to do.

My best guess is that you have some kind of sequence classification task, where you want to consume the entire sequence and produce one output.

The form where features and labels come from the same sequence, is meant for the case were you have both a sequence of features and a sequence of labels.

Usually, when you have only a sequence of features and a single label, they are saved differently, e.g. in two files.

The labels you provide already are a 1x4 matrix. There is nothing forced about it.

If you really have your data saved in a format where you just want to use the last label in your sequence as your overall label (what my best guess is what you are trying to do), then the solution for that is fairly trivial:

Set LabelLastTimeStepPreProcessor (javadoc) as a preprocessor on your dataset iterator.

It will work with sequences of differing lengths and will always use only the very last label in your sequence as the label.

absurd · February 22, 2021, 3:15pm

Aaaaaah. Nice. That’s exactly what I was trying to do; classify a sequence. And that’s exactly the solution I needed.

So basically add the label to each time step, and add the LabelLastTimeStepPreProcessor to only keep the very last label.

absurd · February 22, 2021, 3:23pm

I find this extremely difficult as well, the nomenclature used is varied from source to source. So sometimes its hard to know how to correctly phrase a question.

I already have the solution above from @treo. But for posterity perhaps I should have phrased it as :

I’m building a data pipeline to classify sequences. There are 9 features, with 30 timesteps. (9x30), and a single label (1x1). The caveat for me is that the features and the label have to come from the same reader/file. Is that possible with DL4J? I already know how to do it, with two separate readers, but what about a single reader?

absurd · April 15, 2021, 5:14pm

Err, hate to resurrect this thread, but perhaps I’m not clear on how to use the LabelLastTimeStepPreProcessor.

So, I added the preprocessor to my DataSetIterator and everything looks ok, the INDArray shapes, I sampled look as expected.

iterator.setPreProcessor(new LabelLastTimeStepPreProcessor())

features: Rank: 3, DataType: FLOAT, Offset: 0, Order: f, Shape: [5,11,30],  Stride: [1,5,55]
labels: Rank: 2, DataType: FLOAT, Offset: 0, Order: c, Shape: [5,2],  Stride: [2,1]

Above reflects what I imagined this preprocessor does:
5 samples per batch; 11 features X 30 timesteps per sample
1 label per sample-> 5 labels per batch; 2 classes per label ( one-hot-for-1or0)

However when I try to fit/train this data on a RNN; I get the following exception;

Expected rank 3 labels array, got label array with shape [5, 2]

So is there something else that needs to be done on the RNN side so make it work with the preprocessor? Or did I misunderstand where the preprocessor should be applied?

treo · April 15, 2021, 6:12pm

In that case you need to use a regular OutputLayer instead of an RnnOutputLayer.

absurd · April 15, 2021, 8:13pm

Thanks for your suggestion. I tried a regular OutputLayer instead of an RnnOutputLayer. But get a different exception now.

 java.lang.IllegalArgumentException: Labels and preOutput must have equal shapes: got shapes [5, 2] vs [150, 2]
	at org.nd4j.common.base.Preconditions.throwEx(Preconditions.java:636)
	at org.nd4j.linalg.lossfunctions.impl.LossMCXENT.computeGradient(LossMCXENT.java:149)
....

treo · April 16, 2021, 11:20am

In that case you also need to use either a LastTimeStepLayer or a LastTimeStepVertex right before your output.

absurd · April 16, 2021, 5:05pm

Ok, not sure if I’m missing some intuition about the code base, but I cant figure how to add this LastTimeStepLayer to my nnconf. I searched for some code examples and didnt find any either.

So from what I can see, the LastTimeStepLayer constructor expects an nn.api.Layer

public LastTimeStepLayer(@NonNull @NotNull org.deeplearning4j.nn.api.Layer underlying){
..
}

However when we are building the nn configuration, using NeuralNetConfiguration.Builder and NeuralNetConfiguration.ListBuilder we are working with nn.conf.layers.Layer not the nn.api.Layer that LastTimeStepLayer expects.

i tried something like…

LSTM lstm1 = new LSTM.Builder()...build()
LSTM lstm2 = new LSTM.Builder()...build()
OutputLayer outputLayer = new OutputLayer.Builder()..build()

new NeuralNetConfiguration.Builder()
......
.list()
.layer(0, lstm1)
.layer(1, new LastTimeStepLayer(lstm2))  //<---This is the part that errors and I cant figure out
.layer(2, outputLayer)

Any thoughts? There is some kind package peculiarity here that I cant seem to grasp.

ListBuilder expects a nn.conf.layers.Layer, but LastTimeStepLayer is an nn.api.Layer, so I cant add it to the configuration. Similarly LastTimeStepLayer’s construtor expects and nn.api.Layer, but LSTM is an nn.con.layers.Layer, so I cant even wrap the LSTM in a LastTimeStepLayer in the first place.

treo · April 16, 2021, 5:28pm

Ugh, for some reason there is a naming inconsistency. The thing you’d use in the configuration is called just LastTimeStep.

@agibsonccc maybe we should fix that in the next release?

absurd · April 16, 2021, 6:20pm

Perfect, thanks, LastTimeStep worked as expected.

Topic		Replies	Views
Training data shapping DL4J	4	716	June 5, 2021
Modifying UCI Example DL4J	22	1811	May 21, 2020
Regression Problem DL4J	9	401	July 15, 2022
Questions for Time Series LSTM DL4J	20	2031	February 26, 2020
Reading Audio files into a CNN network DataVec	17	414	November 14, 2023

Label and features in the same file for time series data (SequenceRecordReader.DataSetIterator)

Related topics