DL4J Classification Speed

Greetings!

Does anybody have any idea of what sort of classification speed one can expect using DL4J?

I have a single-hidden-LSTM-layer RNN doing sentiment analysis (heavily inspired by this example) of tweets with the Cuda-10.1-backend (Without cudNN, working on getting that installed but I have limited privileges on the machine) and two Tesla P100-16GB GPU’s. Classifying using net.output(); I get a throughput of about 100 tweets processed per second. This is way lower than I was hoping for, as I achieved a throughput of 15k tweets per second using a CPU-based implementation of Naive Bayes last semester.

Why am I using net.output(); instead of a DataSetIterator you might ask. I am using the network in a streaming context and not on a static dataset.

Does anybody have any experience trying to make NN’s faster and more scalable? Would greatly appreciate any nudge in the right direction.

Best wishes,
Torsten

It depends entirely on what your usecase is.

The worst possible case for inference speed is to run a very small NN on a GPU. This is because there is a certain latency involved with moving your input to the gpu, running the calculation and moving the output back. If you have this one-at-a-time kind of usecase it is usually better to just run it on the CPU directly.

GPUs are faster than CPUs in two cases:

  1. You can batch your requests to run many calculations simultaneously
  2. You have a NN that is widely complex and requires a lot of computation for even a single example

As you have a very small neural network, the first option might be interesting for you. If you just want to process lots of tweets, and they aren’t coming in one at a time, then you can just pass them all in as a single batch to net.output(). Since you are running inference, the batchsize itself will not influence the output for each single example, so you can easily create batches that are as big as your memory can handle. I’d expect that you should be able to get into the 20k tweets/s range easily.

1 Like

Thank you very much, that makes a lot of sense.

Just so I understand you clearly, my best shot would be create some sort of mechanism that bundles up tweets into micro-batches of some size (optimally as big as the memory can handle) before the batch is put through net.output()? The stream I’m working on is having tweets arrive one-at-a-time (often at speeds of 5-20k tweets per second), so it would switch up the processing semantics a bit as the tweets would arrive one-at-a-time but be processed in micro-batches, however, I like the idea and it’s definitely worth a shot.

Thanks again for your help.

Take a look at ParallelInference. It does all of that for you. You can tell it to either queue until it is full, or to wait for a specific time to fill up.

1 Like

Thank you so much for your help. I am trying some stuff out now, would you mind giving me some pointers on how to proceed ParalllelInference? I’ve taken a look at this example, but I am not sure exactly how to run the inference batch style.

Right now I am reading tweets from a CSV-file, converting them to word2vec-features (INDArray) of size 300 and putting them through the pi.output(). To process them batch style, would I have to batch my feature-vectors together in a []INDArray and put that through pi.output()? If that’s the case I would assume the output to come back as a []INDArray that I would have to loop through to get the predictions for the individual tweets (keeping track of the indices going in to get the tweet-sentiment pairs), is this correct?

When you are doing it like that, you don’t have to go through ParallelInference, as you can just read all of them at once and have just a single output request.

I guess you are converting to w2v features manually. Have you taken a look at your application with a profiler to see what is actually slow? Vectorizing the data inefficiently is often the main bottleneck in this kinds of setups, especially when you are using a manual w2v conversion

1 Like

When you are doing it like that, you don’t have to go through ParallelInference, as you can just read all of them at once and have just a single output request.

Yes I am aware, I’m just simulating the streaming conditions by reading/processing them one-at-a-time or now appending them to an array of a certain size and processing them batch-wise as you suggested. The final usecase for the NN will be a User Defined Function for AsterixDB that will process tweets as they are streamed into the system with the Twitter Streaming API as a mean of continual data ingestion.

Have you taken a look at your application with a profiler to see what is actually slow? Vectorizing the data inefficiently is often the main bottleneck in this kinds of setups.

You are absolutely right, I did some testing processing arrays of 1000, 5000, and 10’000 tweets, and couldn’t achieve a throughput higher than 1000 tweets per second for either of them. Turns out most of the processing time was going to vectorization.

I am thinking that instead of using these slow word-vectors an idea might be to use home-made char-vectors. Given that a tweet has a maximum length of 280 characters, 280 is an obvious choice for the vector length. Each Character could be mapped to a corresponding Integer, and tweets shorter than 280 could be padded out. The vectorization-process then would be a series of HashMap lookups to get the Integers corresponding to the Characters, which is a series of O(1) lookups and should be pretty fast.

@treo does this seem like a sound way to proceed? As long as the accuracy is above 70% or so I only care about throughput for this project, that is, maximizing the amount of tweets processed per second in a streaming context.

Are you training or only classifying the results?

Only classifying. The training happens offline, then the model is bundled into a User Defined Function and deployed to a BDMS to classify tweets that are being continually ingested into the system.

Understood. Are you using Vector model?

How exactly are you running your vectorization? Usually our W2V is quite fast with its lookups.

That can work, with something like a CharCNN based setup, but usually using something with pretrained word vectors is easier to work with.

I think I am using the exact same code as in the example I am drawing inspiration from.

public INDArray loadFeaturesFromString(String reviewContents, int maxLength){
    List<String> tokens = this.tokenizerFactory.create(reviewContents).getTokens();
    List<String> tokensFiltered = new ArrayList<>();
    for(String t : tokens ){
        if(wordVectors.hasWord(t)) tokensFiltered.add(t);
    }
    int outputLength = Math.min(maxLength,tokensFiltered.size());

    INDArray features = Nd4j.create(1, vectorSize, outputLength);

    int count = 0;
    for( int j=0; j<tokensFiltered.size() && count<maxLength; j++ ){
        String token = tokensFiltered.get(j);
        INDArray vector = wordVectors.getWordVectorMatrix(token);
        if(vector == null){
            continue;   //Word not in word vectors
        }
        features.put(new INDArrayIndex[]{NDArrayIndex.point(0), NDArrayIndex.all(), NDArrayIndex.point(j)}, vector);
        count++;
    }
    return features;
}

What you might want to try instead is to set your embedding matrix as the weights of an embedding layer, and then feed it with word indexes instead.

That way you don’t create unnecessarily large inputs and the lookup should be a lot faster.

1 Like

Sound good.

Right now the model is loaded by

WordVectors wordVectors = WordVectorSerializer.loadStaticModel(new File(WORD_VECTORS_PATH));

And the Network Config looks like

        MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
        .seed(seed)
        .updater(new Adam(5e-3))
        .l2(1e-5)
        .weightInit(WeightInit.XAVIER)
        .gradientNormalization(GradientNormalization.ClipElementWiseAbsoluteValue).gradientNormalizationThreshold(1.0)
        .list()
        .layer(new LSTM.Builder().nIn(vectorSize).nOut(256)
            .activation(Activation.TANH).build())
        .layer(new RnnOutputLayer.Builder().activation(Activation.SOFTMAX)
            .lossFunction(LossFunctions.LossFunction.MCXENT).nIn(256).nOut(2).build())
        .build();

So I guess I would be adding an embedding layer before the LSTM layer, but how would I go about setting the weights?

Using wordVectors.lookupTable().getWeights() you can get the actual weights, and wordVectors.vocab() and vocab.indexOf(word) will give you a way to get the word index.

If you want to use static weights for your case, you should also set the learningrate for the Embedding layer to 0, and after initialization, you can set the weights for it with model.setParam("1_W", weights). The name of the correct parameter view might be slightly different, you can look it up in the keys of model.paramTable().

1 Like

I think you can improve the speed by using a custom implementation with ChronicleMap to load the vectors.

Let’s not introduce new moving parts until the actual bottleneck is identified.

Unless @torstenbm finds that getting values from a map is the actual bottleneck, adding a different map implementation isn’t going to help at all.

1 Like

Hey @treo so I tried the char-based approach with great success.

Because a tweet is max 280 characters I convert a tweet-string to a double[280] and then create batches of 10k tweets with the shape of double[10000][280] tweetBatch.

Then I run them through my network using

INDArray features = Nd4j.create(tweetBatch);
INDArray networkOutput = pi.output(features);

and have with this achieved throughputs of up to 230k tweets per second, which is a huge improvement!

Only thing I haven’t been able to do yet is retrieve the sentiment for each tweet from the networkOutput object, would you mind helping me construct a loop doing that? My network has an input-layer of 280 nodes and an output-layer of 2 nodes.

Great to hear!

You are likely looking for the argMax method. It gives you the index of the label with the highest probability.

Am I though? I am classifying 10’000 tweets at once with a 10000x280 INDArray, at output there is one node activating for positive tweets and one node for negative tweets, and I am looking to get the activation values associated with each of these nodes for each tweets in order to see whether the model predicts the tweet to be more positive or more negative.