25% GPU Usage on 1080 ti

abed · February 2, 2022, 10:29pm

Hello,

I am facing difficulties training any model using my nvidia 1080 ti, getting much faster results using the CPU alone.

The GPU is poorly utilized at 25% all the time.

I have tried the seq2seq example from dl4j-examples.

OS: Windows 10 Pro
CuDNN: 8.1.1
Cuda: 11.2.0_460.89

Rolled back cuda version to 10.2 - cudnn 7.6 and tweaked batch size 32 - 64 - 128 - 256 but nothing has worked.

[Album] imgur.com

o.n.l.f.Nd4jBackend - Loaded [JCublasBackend] backend
o.n.n.NativeOpsHolder - Number of threads used for linear algebra: 32
o.n.l.a.o.e.DefaultOpExecutioner - Backend used: [CUDA]; OS: [Windows 10]
o.n.l.a.o.e.DefaultOpExecutioner - Cores: [8]; Memory: [8.0GB];
o.n.l.a.o.e.DefaultOpExecutioner - Blas vendor: [CUBLAS]
o.n.l.j.JCublasBackend - ND4J CUDA build version: 11.2.142
o.n.l.j.JCublasBackend - CUDA device 0: [NVIDIA GeForce GTX 1080 Ti]; cc: [6.1]; Total memory: [11810832384]
o.n.l.j.JCublasBackend - Backend build information:
 MSVC: 192930038
STD version: 201703L
CUDA: 11.2.142
DEFAULT_ENGINE: samediff::ENGINE_CUDA
HAVE_FLATBUFFERS
HAVE_CUDNN
o.d.n.g.ComputationGraph - Starting ComputationGraph with WorkspaceModes set to [training: ENABLED; inference: ENABLED], cacheMode set to [NONE]

Thank you for your time

treo · February 3, 2022, 7:10am

There can be many reasons why this is slower than the CPU, and RNN type networks often perform poorly on GPUs.

The main reason for RNNs is that they are by definition sequential, you can always only calculate only one (batch) step at a time.

abed · February 3, 2022, 5:57pm

Is there any chance to reach 70% or 80% GPU performance with LSTM?

I’ve followed most of the GPU optimisation tips here:

Yet nothing has worked out and the usage doesn’t exceed 25%.

I really like DL4J and don’t want to use any other framework, any kind help is truly appreciated.

abed · February 3, 2022, 6:12pm

Noticed something weird, the GPU usage is still 25% when running multiple examples on GPU at the same time.

treo · February 3, 2022, 6:27pm

It is hard to really tell with the amount of details you are sharing. What exactly have you tried? Do you have anything we can use to reproduce your case?

Sometimes, you may get low utilization because your data loading is just too slow to keep up. But usually I see that more with Python frameworks.

Another reason would be that you’re not using the CuDNN variant, so it doesn’t use the more optimized LSTM codepath.

Anyway, that mgubaidullin.github.io page looks like a very old fork of the old DL4J documentation, so anything you read there is likely going to be out of date.

abed · February 3, 2022, 7:00pm

Tried the following examples from dl4j-examples:
AdditionModelWithSeq2Seq - LSTM - 25% utilization
SequenceAnomalyDetection - LSTM - 22% utilization
TrainLotteryModelSeqPrediction- LSTM - 26% utilization
TinyYoloHouseNumberDetection - CNN - 95% utilization

It seems that it’s purely LSTM issue, CNN leverages the GPU very well.

I skipped installing display driver in the Nvidia CUDA Toolkit installer since my current version 511.23 is more recent than that in the installer 460.89, I have only installed CUDA related components, not sure if it somehow relates to my issue.

Cuda/CuDNN versions:
cudnn-11.2-windows-x64-v8.1.1.33
cuda_11.2.0_460.89_win10

pom.xml:

...
<nd4j.backend>nd4j-cuda-11.2-platform</nd4j.backend>
...
<dependencies>
        <dependency>
            <groupId>org.nd4j</groupId>
            <artifactId>${nd4j.backend}</artifactId>
            <version>${dl4j-master.version}</version>
        </dependency>

        <dependency>
            <groupId>org.nd4j</groupId>
            <artifactId>nd4j-cuda-11.2</artifactId>
            <version>${dl4j-master.version}</version>
            <classifier>windows-x86_64-cudnn</classifier>
        </dependency>
...
</dependencies>

treo · February 3, 2022, 7:59pm

All of those models are tiny, and as RNN type models are limited to working sequentially, even the 25% utilization seem like quite a lot in this case.

That is one of the reasons Attention / Transformer models are dominating sequence tasks these days: they can actually use the hardware.

SidneyLann · February 10, 2022, 3:16am

@abed Only tanh activation function has more than 50% utilization for LSTM in gpu.

abed · February 13, 2022, 2:00pm

@treo Does the dl4j-examples contain any implementation for the Attention/Transformer model ?

treo · February 14, 2022, 8:55am

As of right now it doesn’t contain an example implementation, but you should be able to import a tensorflow or onnx based model rather easily on the current snapshots.

Topic		Replies	Views
Only two threads are running when switching training to GPU DL4J	2	422	May 29, 2020
It seems CPU is utilized, though BACKEND_PRIORITY_GPU=10 and BACKEND_PRIORITY_CPU=1 Tuning Help	2	328	May 8, 2023
GPU Error between epochs ND4J	3	502	October 15, 2020
GPU Performance is worse than CPU Tuning Help	4	171	February 15, 2024
Basic deeplearning4j classification example DL4J	4	1009	February 3, 2020

25% GPU Usage on 1080 ti

Related topics