25% GPU Usage on 1080 ti

Hello,

I am facing difficulties training any model using my nvidia 1080 ti, getting much faster results using the CPU alone.

The GPU is poorly utilized at 25% all the time.

I have tried the seq2seq example from dl4j-examples.

OS: Windows 10 Pro
CuDNN: 8.1.1
Cuda: 11.2.0_460.89

Rolled back cuda version to 10.2 - cudnn 7.6 and tweaked batch size 32 - 64 - 128 - 256 but nothing has worked.

o.n.l.f.Nd4jBackend - Loaded [JCublasBackend] backend
o.n.n.NativeOpsHolder - Number of threads used for linear algebra: 32
o.n.l.a.o.e.DefaultOpExecutioner - Backend used: [CUDA]; OS: [Windows 10]
o.n.l.a.o.e.DefaultOpExecutioner - Cores: [8]; Memory: [8.0GB];
o.n.l.a.o.e.DefaultOpExecutioner - Blas vendor: [CUBLAS]
o.n.l.j.JCublasBackend - ND4J CUDA build version: 11.2.142
o.n.l.j.JCublasBackend - CUDA device 0: [NVIDIA GeForce GTX 1080 Ti]; cc: [6.1]; Total memory: [11810832384]
o.n.l.j.JCublasBackend - Backend build information:
 MSVC: 192930038
STD version: 201703L
CUDA: 11.2.142
DEFAULT_ENGINE: samediff::ENGINE_CUDA
HAVE_FLATBUFFERS
HAVE_CUDNN
o.d.n.g.ComputationGraph - Starting ComputationGraph with WorkspaceModes set to [training: ENABLED; inference: ENABLED], cacheMode set to [NONE]

Thank you for your time :slight_smile:

There can be many reasons why this is slower than the CPU, and RNN type networks often perform poorly on GPUs.

The main reason for RNNs is that they are by definition sequential, you can always only calculate only one (batch) step at a time.

Is there any chance to reach 70% or 80% GPU performance with LSTM?

I’ve followed most of the GPU optimisation tips here:

Yet nothing has worked out and the usage doesn’t exceed 25%.

I really like DL4J and don’t want to use any other framework, any kind help is truly appreciated.

Noticed something weird, the GPU usage is still 25% when running multiple examples on GPU at the same time.

It is hard to really tell with the amount of details you are sharing. What exactly have you tried? Do you have anything we can use to reproduce your case?

Sometimes, you may get low utilization because your data loading is just too slow to keep up. But usually I see that more with Python frameworks.

Another reason would be that you’re not using the CuDNN variant, so it doesn’t use the more optimized LSTM codepath.

Anyway, that mgubaidullin.github.io page looks like a very old fork of the old DL4J documentation, so anything you read there is likely going to be out of date.

Tried the following examples from dl4j-examples:
AdditionModelWithSeq2Seq - LSTM - 25% utilization
SequenceAnomalyDetection - LSTM - 22% utilization
TrainLotteryModelSeqPrediction- LSTM - 26% utilization
TinyYoloHouseNumberDetection - CNN - 95% utilization

It seems that it’s purely LSTM issue, CNN leverages the GPU very well.

I skipped installing display driver in the Nvidia CUDA Toolkit installer since my current version 511.23 is more recent than that in the installer 460.89, I have only installed CUDA related components, not sure if it somehow relates to my issue.

Cuda/CuDNN versions:
cudnn-11.2-windows-x64-v8.1.1.33
cuda_11.2.0_460.89_win10

pom.xml:

...
<nd4j.backend>nd4j-cuda-11.2-platform</nd4j.backend>
...
<dependencies>
        <dependency>
            <groupId>org.nd4j</groupId>
            <artifactId>${nd4j.backend}</artifactId>
            <version>${dl4j-master.version}</version>
        </dependency>

        <dependency>
            <groupId>org.nd4j</groupId>
            <artifactId>nd4j-cuda-11.2</artifactId>
            <version>${dl4j-master.version}</version>
            <classifier>windows-x86_64-cudnn</classifier>
        </dependency>
...
</dependencies>

All of those models are tiny, and as RNN type models are limited to working sequentially, even the 25% utilization seem like quite a lot in this case.

That is one of the reasons Attention / Transformer models are dominating sequence tasks these days: they can actually use the hardware.

@abed Only tanh activation function has more than 50% utilization for LSTM in gpu.

@treo Does the dl4j-examples contain any implementation for the Attention/Transformer model ?

As of right now it doesn’t contain an example implementation, but you should be able to import a tensorflow or onnx based model rather easily on the current snapshots.