Label and features in the same file for time series data (SequenceRecordReader.DataSetIterator)

The SequenceRecordReaderDataSetIterator has a form where the features and the labels come from the same reader

/** Constructor where features and labels come from the SAME RecordReader (i.e., target/label is a column in the
     * same data as the features). Defaults to regression = false - i.e., for classification
     * @param reader SequenceRecordReader with data
     * @param miniBatchSize size of each minibatch
     * @param numPossibleLabels number of labels/classes for classification
     * @param labelIndex index in input of the label index. If in regression mode and numPossibleLabels > 1, labelIndex denotes the
     *                   first index for labels. Everything before that index will be treated as input(s) and
     *                   everything from that index (inclusive) to the end will be treated as output(s)
     */
    public SequenceRecordReaderDataSetIterator(SequenceRecordReader reader, int miniBatchSize, int numPossibleLabels,
                    int labelIndex) {
        this(reader, miniBatchSize, numPossibleLabels, labelIndex, false);
    }

Im a bit confused about how this data should be shaped.
Does that mean the label should be repeated for every time step? Meaning each step in the time series record would have the same value in the label column?
t|features|label
1|FFFFF|L
2|FFFFF|L
3|FFFFF|L

Or is there some other shape that I’m not seeing? Like perhaps the label should be formatted as the very last time step?
1FFFFF
2FFFFF
3FFFFF
L

@absurd Take a look at a sample problem:

Here isthe associated test files repo:

Input Data:

Labels:

It basically boils down to 1 per step. Hopefully that helps.

Thanks. Thats not quite the solution i was looking for. I think it just further reinforces my conclusion that its impossible to encode what I needed in a single reader.

Here is my experience (code follows in the next post).

Case 1:
Separate readers, one for features one for labels
Features hardcoded are in 3x4 grid, 3 features, 4 timesteps
Labels are hardcoded to be 1.

After passing throught the iterator

Features       Labels
  3x4            1x1
 [f,f,f
  f,f,f   -->    [l]
  f,f,f
  f,f,f]

This makes sense the whole time sequence, (3x4) received a single label (1x1)

Case 2:

Here im trying to reproduce the above behavior (3x4) → (1x1) with a single reader.

The features remain in a hardcoded 3x4 matrix, but now the hardcoded label is appended to each row
making it a 4x4 matrix

When passed through the sequence iterator the features are (3x4), as before, but now labels are in a (1x4)

 SingleSrc    Features       Labels
   4x4         3x4            1x4
[f,f,f,l,    [f,f,f          [l,
 f,f,f,l,     f,f,f   -->     l,
 f,f,f,l      f,f,f           l
 f,f,f,l]     f,f,f]          l]

So my conclusion is its impossible to achieve the reduction [C x R] → [1x1] with a single
reader, instead you are limited to a reduction of [C x R] → [1 x R]

Here is my code (Groovy) to test for the above conclusion.

       def w = { //convert int to a writable
                 return new org.datavec.api.writable.IntWritable(it)
            }

Case 1: Two readers one for features (3x4) one for labels (1x1)

List<List<Writable<Integer>>> lb = [[[new org.datavec.api.writable.IntWritable(1)]]]
def labelReader = new CollectionSequenceRecordReader(lb)

List<List<Writable<Integer>>> ft = [[[w(101),w(102),w(103)],
									 [w(104),w(105),w(106)],
									 [w(107),w(108),w(109)],
									 [w(110),w(111),w(112)]]]

def featureReader = new CollectionSequenceRecordReader(ft)

SequenceRecordReaderDataSetIterator iterator = new SequenceRecordReaderDataSetIterator(featureReader, labelReader, 10, -1, true)
def seq = iterator.next()
println "features: "+seq.features.shapeInfoToString()
println "labels: "+seq.labels.shapeInfoToString()

The output on this one is:
features: Rank: 3,Offset: 0
Order: f Shape: [1,3,4], stride: [1,1,3]
labels: Rank: 3,Offset: 0
Order: f Shape: [1,1,1], stride: [1,1,1]
This makes sense, the features are 3x4 and the labels are 1x1

Case 2: single reader 4x4, label are at index 3

List<List<Writable<Integer>>> ft2 = [[[w(101),w(102),w(103), w(1)],
								 [w(104),w(105),w(106), w(1)],
								 [w(107),w(108),w(109), w(1)],
								 [w(110),w(111),w(112), w(1)]]]

def singleReader = new CollectionSequenceRecordReader(ft2)

SequenceRecordReaderDataSetIterator iterator2 = new SequenceRecordReaderDataSetIterator(singleReader, 10, -1, 3, true)

def seq2 = iterator2.next()
println "features: "+seq2.features.shapeInfoToString()
println "labels: "+seq2.labels.shapeInfoToString()

The output here is :
features: Rank: 3,Offset: 0
Order: f Shape: [1,3,4], stride: [1,1,3]
labels: Rank: 3,Offset: 0
Order: f Shape: [1,1,4], stride: [1,1,1]

Here the features are 3x4 matrix as before, but the labels are now forced into 1x4 matrix.

@absurd sorry your question isn’t clear to me then. If you need your data in a specific shape, could you describe what your direct goal is? Like “data pipeline that produces ndarrays in this format with labels laid out like xyz” would help.

Maybe I can help with the solution rather than trying to dissect the interpretation of the problem.

From my guess you need the data in a specific layout and are trying to get it there.

Sorry for not quite understanding what you want to do here.

As @agibsonccc said, I think we need an idea for what you are trying to do.

My best guess is that you have some kind of sequence classification task, where you want to consume the entire sequence and produce one output.

The form where features and labels come from the same sequence, is meant for the case were you have both a sequence of features and a sequence of labels.

Usually, when you have only a sequence of features and a single label, they are saved differently, e.g. in two files.

The labels you provide already are a 1x4 matrix. There is nothing forced about it.

If you really have your data saved in a format where you just want to use the last label in your sequence as your overall label (what my best guess is what you are trying to do), then the solution for that is fairly trivial:

Set LabelLastTimeStepPreProcessor (javadoc) as a preprocessor on your dataset iterator.

It will work with sequences of differing lengths and will always use only the very last label in your sequence as the label.

Aaaaaah. Nice. That’s exactly what I was trying to do; classify a sequence. And that’s exactly the solution I needed.

So basically add the label to each time step, and add the LabelLastTimeStepPreProcessor to only keep the very last label.

:+1:

I find this extremely difficult as well, the nomenclature used is varied from source to source. So sometimes its hard to know how to correctly phrase a question.

I already have the solution above from @treo. But for posterity perhaps I should have phrased it as :

I’m building a data pipeline to classify sequences. There are 9 features, with 30 timesteps. (9x30), and a single label (1x1). The caveat for me is that the features and the label have to come from the same reader/file. Is that possible with DL4J? I already know how to do it, with two separate readers, but what about a single reader?

1 Like

Err, hate to resurrect this thread, but perhaps I’m not clear on how to use the LabelLastTimeStepPreProcessor.

So, I added the preprocessor to my DataSetIterator and everything looks ok, the INDArray shapes, I sampled look as expected.

iterator.setPreProcessor(new LabelLastTimeStepPreProcessor())


features: Rank: 3, DataType: FLOAT, Offset: 0, Order: f, Shape: [5,11,30],  Stride: [1,5,55]
labels: Rank: 2, DataType: FLOAT, Offset: 0, Order: c, Shape: [5,2],  Stride: [2,1]

Above reflects what I imagined this preprocessor does:
5 samples per batch; 11 features X 30 timesteps per sample
1 label per sample-> 5 labels per batch; 2 classes per label ( one-hot-for-1or0)

However when I try to fit/train this data on a RNN; I get the following exception;

Expected rank 3 labels array, got label array with shape [5, 2]

So is there something else that needs to be done on the RNN side so make it work with the preprocessor? Or did I misunderstand where the preprocessor should be applied?

In that case you need to use a regular OutputLayer instead of an RnnOutputLayer.

Thanks for your suggestion. I tried a regular OutputLayer instead of an RnnOutputLayer. But get a different exception now.

 java.lang.IllegalArgumentException: Labels and preOutput must have equal shapes: got shapes [5, 2] vs [150, 2]
	at org.nd4j.common.base.Preconditions.throwEx(Preconditions.java:636)
	at org.nd4j.linalg.lossfunctions.impl.LossMCXENT.computeGradient(LossMCXENT.java:149)
....

In that case you also need to use either a LastTimeStepLayer or a LastTimeStepVertex right before your output.

Ok, not sure if I’m missing some intuition about the code base, but I cant figure how to add this LastTimeStepLayer to my nnconf. I searched for some code examples and didnt find any either.

So from what I can see, the LastTimeStepLayer constructor expects an nn.api.Layer

public LastTimeStepLayer(@NonNull @NotNull org.deeplearning4j.nn.api.Layer underlying){
..
}

However when we are building the nn configuration, using NeuralNetConfiguration.Builder and NeuralNetConfiguration.ListBuilder we are working with nn.conf.layers.Layer not the nn.api.Layer that LastTimeStepLayer expects.

i tried something like…

LSTM lstm1 = new LSTM.Builder()...build()
LSTM lstm2 = new LSTM.Builder()...build()
OutputLayer outputLayer = new OutputLayer.Builder()..build()

new NeuralNetConfiguration.Builder()
......
.list()
.layer(0, lstm1)
.layer(1, new LastTimeStepLayer(lstm2))  //<---This is the part that errors and I cant figure out
.layer(2, outputLayer)

Any thoughts? There is some kind package peculiarity here that I cant seem to grasp.

ListBuilder expects a nn.conf.layers.Layer, but LastTimeStepLayer is an nn.api.Layer, so I cant add it to the configuration. Similarly LastTimeStepLayer’s construtor expects and nn.api.Layer, but LSTM is an nn.con.layers.Layer, so I cant even wrap the LSTM in a LastTimeStepLayer in the first place.

Ugh, for some reason there is a naming inconsistency. The thing you’d use in the configuration is called just LastTimeStep.

@agibsonccc maybe we should fix that in the next release?

Perfect, thanks, LastTimeStep worked as expected.