Question regarding the LSTM training and data format

Good afternoon,

I wanted to ask if I am on the right path for training my LSTM RNN. I have been reading this forum, had a look at git, all the documentation and there seems to be some outdated information. I could not find a coherent, single place for an LSTM tutorial, or it had not the full information needed. Therefore, I will appreciate any help.

I have read the training notes, advice for choosing different parameters, expected input to the RNN on the official website. However, my network is not performing as expected, which might be a question of tuning the parameters. Regardless, I wanted reassurance that the input data is in the right format and the way I am training the network is correct.

LSTM RNN task: I have a single time series of numbers (x1), and I have a second-time series which is the expected output (x2), basically I have 2 variables (x1 and x2) that correlate and given the first one (x1) I want to be able to predict the second one (x2) at next timestep. From the tutorial on the official website, I have formatted by data in the following way [1- as I have only one time serries][1 - as I have the first variable (x1) as input][ 500 - as this is the length of my time series, basically it is 500 seconds]. This is my features data format, is it correct? My label data format is [1][1 - storing the second variable][500]. I combine both to the DataSet. Is it correct?

Lastly, I wanted to enquire about the training. I have seen that the getFeaturesMatrix() method on the data iterator is deprecated and now removed. As a result, I have a for loop, looping through a specified number of epochs and I simply call the fit() method passing the whole data set and then I clear the previous RNN state (in the loop).

As my data is in the same units, I do not normalise it, so I am not expecting it to be a cause of the problem.

Overall, I just wanted to ask if I format the data correctly, as I am suspecting that somehow I am using only 1 sample, not the full time series, could that be the case?

Best Regards
G.

In any case training works in mini-batches.

That means that the shape of the data expected for training is always has the mini batch size as its first dimension. So you can have the following shapes depending on the problem you are solving and the model type you are using:

  • MLP: [batch size, feature size]
  • RNN: [batch size, feature size, timestep count]
  • CNN: [batch size, channels, height, width] or [batch size, height, width, channels] (depending on configuration)

So your data format looks to be correct in principle.

However, it looks like you are trying to skip a few steps when it comes to data preparation. Given your question it looks like you are creating the DataSet objects manually and don’t use a DataSetIterator.

When using a DataSetIterator you can train for as long as you want just by calling model.fit(iterator, epochCount).

If your data has some weird format, or you just don’t want to use any of the existing SequenceRecordReaders to read your data from its storage, you can still use the CollectionSequenceRecordReader and provide it with a List<List<List<DoubleWritable>>>. I know that type is ugly, but it is really simple to understand: It is a list of example sequences, where each sequence contains a record (= a list of double values wrapped in a DoubleWritable).

Then you create a SequenceRecordReaderDataSetIterator from that record reader and tell it that you are running a regression and that the label is in the second column (or where ever you have put it).

That way you can make use of all the comforts that DL4J provides you around vectorization while still having your data in memory.

Why go through all of that effort if you effectively just have a single example? Well, one of the reasons is that you maybe don’t want to have a single example after all.

RNNs, and even LSTMs, have a rather short “memory” of the previous steps, so feeding them with the full sequence can useless. Splitting the sequence into many windows may be a better way to approach the problem. That provides you with more parallelization during training (as recurrent networks are inherently sequential otherwise) and more updates to your model weights in each epoch.

Unless your data is also already in the -1 to 1 range or has a zero mean and unit variance, you likely need to normalize your data anyway.

Remember that most activation functions for neural networks have an active range that is between -1 and 1 and are quickly saturated outside that range (it is mostly relu variants can work outside that range).

Without actually knowing what your data is and how you’ve set up your model architecture, it is going to be hard to tell you what is going wrong. Many toy examples that people come up with to “simplify learning”, can be really hard to work with, because tuning them correctly requires good knowledge of how the math actually works.

Another common problem is that people blindly copy learning rates, update algorithms and regularization, and then the network doesn’t learn at all or too slowly and looks like it isn’t learning.

Thank you for such an elaborate answer, massively appreciated! I have listened to your advice and changed the data storage type. For other people reading the comments, these tests were great examples.

Regarding the normalization, I have listened to your advice again and used normalizers from the dl4j library.

Regarding the learning rates, etc. I think I have solid grounds for my choices (I have some experience with it and have done my research), so it is not the case of CTRL + C. After the mentioned changes the network is performing somewhat sensible. So thank you for that. However, when I visualise the learning process, the score vs iteration chart and the value of the loss function is quite high. Could it be caused by normalization? Sorry, but here I am clueless, so any intuitive advice would be great!

Regardless, thanks for the confirmation and advice!

There are many reasons why that may happen. Each loss function has different behaviors and it can be hard to give you any good advice without knowing how the curve looks like and what your hyperparameters are.