Tuning a simple 2 layer LSTM

Hi am training an LSTM model with 2 layers and a batch size of 32 on a data set of 15000 chat utterances however it’s taking almost 4 hours to train. The stranger thing is that it takes the same amount of time on my laptop which is an i7 4 core and 8 logical processor vs that on a Linux bare metal server which has 256 GB ram and 16 cores. Any insights would be helpful here.
I did go through ur link on performance tuning and tried some suggestions there but nothing seems to be working.

I’ve split this from the introductory post. In a separate topic it is a whole lot easier to talk about your problem.

That isn’t strange at all. Your description sounds like you have a rather small network and you are using a recurrent network. This means that most of the computations have to be done in sequence. More parallel resources don’t help in that case.

However, in order for us to help you here, we will need to know a bit more about the model you are training. Can you share your training code here?

Hi Treo, unfortunately I cannot paste the code exactly here. But I’ll try and give details as mush as I can here.

Building a ComputaionalGraph with a configuration using :
SGD as the optimization algorithm
Adam as the updated using l2 regularization and
Xavier as the weight initializer.
L1 with an input feature vector for each token in an utterance with 100 hidden layers followed by an Rnnoutput layer
Activation sigmoid and loss as binary crossentropy.

The model is trained with a batch size of 32 over a number of epochs until the stopping criteria is met.

I have 2 CPUs and 32 cores wondering if there is anything at all that I could potentially parallelize to train my model faster.

Not having the code makes things a bit harder to quantify.

But from what you’ve given us so far, it looks like you have a very small network and your batch size is also quite small. So there just isn’t that much computation to to parallelize during a single batch.

What you can do however, is try to use ParallelWrapper to train on multiple batches at once. It basically uses your computer the same way that a spark cluster would be used.

In our examples it is always used in conjunction with a multi-gpu setup, but it doesn’t require it. So if you have multiple cpu’s or if your network is too small to effectively use the single cpu you have, then you may get a faster training with ParallelWrapper.

See ParallelWrapper for the JavaDoc (beta6). And you will have to add another dependency to your project:


But, before you go down this route, you should probably also add a PerformanceListener to your model, to see where most of the time is spent. It could very well be that most of the time is spent in ETL (i.e. loading your data) instead of training the model.

Thanks Treo will try that. I have already checked the ETL using the Performance listener and it consistently gives 0 except for the first batch . So don’t think ETL is the bottle neck.

Another thing, using AVX2 with beta6 did seem to decrease the time by 10% (just looking at time for few epochs) still need to try that on a bigger set for the full train.

Another question I had is to do with OMP_NUM_THREADS parameter , is that something I should be explicitly setting ?

Bw I m exposing the model building as a web service which may trigger separate threads for separate sets of data to use for model training .

If you have an AVX2 or AVX512 capable processor, using the appropriate packages should help speed up the computation too. But, they too will only get you that far.

We typically use half of the reported threads of the system, because using hyperthreading usually leads to reduced performance on numerical code. If you are training multiple models at once, reducing that number might be beneficial though. What exactly works on your system is something only you can figure out by trying different values.

Thanks @treo ! Over the last week I have been experimenting with a lot of combinations , training 1 model at a time using AVX2 with and without mil. Also using parallelwrapper .
I am seeing some positive results. However when I use parallel wrapper I don’t see consistent results. And am guessing the reason for that could be averaging and the way it selecting the batch of data. Could you please clarify if there is a way to get reproducible results when the data set does not change ?

If I remember correctly, when training with parallel wrapper, you have something like a cluster running locally. The calculations aren’t run in lock-step. Which means that on each re-run, each thread can be done sooner or later, and thereby get updates at different points in time.

@raver119 can it be configured to behave deterministically?