Significant Performance Drop in `nn.fit()` method After Upgrading to NVIDIA 3060 and CUDA 11.6

I have been using 1660ti and cuda10.2 with DL4J 1.0.0beta7.
We have now changed to 3060, cuda11.6, and DL4J 1.0.0M2.1.

memo
The CUDA version was increased because the GPU is not compatible with 10.2.
The reason we raised the DL4J version is because DL4J does not support CUDA 11 series.
In other words, this is the only configuration that can be tested in the current environment.
Driver is up-to-date.
cuda and cudnn are included with maven.

Even though the GPU has become much more powerful, the learning speed has dropped to about 1/3 of what it was before.

The GPU usage is 100% in windows task manager, but in GPU-Z, the GPU Load is about 20%.

I measured the execution speed of the network.fit() method, and it took a lot of time, 2.5-4 seconds even for a small mini-batch on a very simple network.

The cause is completely due to the slow execution of the fit method, though,
The cause is unknown because even with a very simple test, execution is still slow.

CUDA is recognized by DL4J,
I tried to run ND4J.create() many times to see if it would increase the GPU memory.

What could be the cause?

I tried it on a simple network, so it is inconceivable that it would take 2.5-4 seconds.
There must be a problem elsewhere.

On a very simple network,
It is also important to note that the fit method call takes the same amount of time for both mini-batch 32 and mini-batch 256.
The mini-batch 256 is naturally more computationally intensive, but for some reason it takes almost the same amount of time to call the fit method.

When I tried training the same network from 3060 to 1660, I found that 1660 was 10 times faster than 3060.
The call to the fit method takes 500 milliseconds.

At 3060 it was 5 seconds.

@Nasyumaro let me take a look at this in the next version. A new release is pending. I"m spending a lot of time on cuda QA now. In the mean time could you file a reproducer on github? Sign in to GitHub · GitHub thanks!

1 Like

No need to read

<dependency>
        <groupId>org.bytedeco</groupId>
        <artifactId>cuda-platform-redist</artifactId>
        <version>11.4-8.2-1.5.6</version>
</dependency>

I tried using ↑ this due to the following concerns, but the problem was not resolved.

It has been confirmed in the log that the CUDA version etc. have been switched.

Concerns that were not actually a problem (No need to read)

Here, CUDA11.6 and CuDNN8.3 are defined as a combination.

however,

According to this site, CuDNN8.3 corresponding to CUDA11.6 does not exist.

@Nasyumaro it does here: Maven Central Repository Search this is literally what was released. Looking at our build file this is the cudnn we use:
https://developer.download.nvidia.com/compute/redist/cudnn/v8.3.2/local_installers/11.5/cudnn-linux-x86_64-8.3.2.44_cuda11.5-archive.tar.xz

These are the builds that it was released with, see the build script:

1 Like

I would like to provide the code, but the program is so large that I cannot provide a fragment that can be tested… Let me think about it a little more…

With the exact same program and network
This phenomenon occurs just by changing the GPU and CUDA.
(learning speed is reduced to 1/10 even though change the GPU to high)

For now, the test network is as follows

  • The total parameter for input is about 60.
  • The time step of input is 120
  • Output is 2
val conf: ComputationGraphConfiguration = NeuralNetConfiguration.Builder()
.seed(LearningMain.rand.nextLong())
.weightInit(WeightInit.XAVIER)
.miniBatch(true)
.updater(Adam())
.optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
.graphBuilder()
.addInputs(inputFunc.map { it.name })

.addLayer(
    "${fc}_LSTM1", LSTM.Builder()
        .nIn(inputFunc3D.sumBy { it.numColumn }).nOut(128)
        .activation(Activation.TANH)
        .build(),
    *inputFunc3D.map { it.name }.toTypedArray()
)

.addLayer(
    "${fc}_LSTM2", LSTM.Builder()
        .nIn(128).nOut(128)
        .activation(Activation.TANH)
        .build(),
    "${fc}_LSTM1"
)
.addLayer(
    "LTS", LastTimeStep(LSTM.Builder()
        .activation(Activation.TANH)
        .nIn(128).nOut(128)
        .build()),
    "${fc}_LSTM2"
)

.addLayer(
    "FC", DenseLayer.Builder()
        .activation(Activation.TANH)
        .nIn(128 + 1).nOut(128)
        .build(),
    "LTS", "centerWidth2D"
)

.addLayer("MCXENT", OutputLayer.Builder(LossFunctions.LossFunction.MCXENT)
    .activation(Activation.SOFTMAX)
    .nIn(128).nOut(outputNames[0].size)
    .build(),
    "FC"
)

.setOutputs("MCXENT")
.build()

val nn = ComputationGraph(conf)
nn.init()

If we made the network roughly twice as many layers, it would take twice as long to run the fit method.

The number of input and output parameters has not changed except for increasing the middle layer of the network, so the data that is supposed to be transferred to the GPU every epoch has not increased.

Of course, as the network becomes more complex, calling fit becomes slower.
However, since the network is poor, I don’t think it would take this long to calculate the matrix.

From this, I thought that something caused by the increase in layers (overhead time that spans CPU<->GPU that occurs for each layer within an epoch) may be the cause.


I tried again to measure the execution time with the same network, changing only the number of batches.
I trained 32 and 512 batches and the execution time of the fit method did not change at all.

i think this is also strange.
maybe there seems to be a problem outside of the matrix calculation.


I tried changing the GC settings and there was no change.

Nd4j.getMemoryManager().togglePeriodicGc(false)
Nd4j.getMemoryManager().autoGcWindow = 10000

And I also tried changing the CacheType of the workspace, but there was no change here either.


In the text, Windows Task Manager shows 100% GPU utilization and GPU-Z shows 25% GPU Load, but there was probably no problem in that respect.
The 1660ti showed similar figures.
It seems to be just a matter of difference in the measurement method.


I remember another strange phenomenon that occurred.

This was not during learning, but during operation with 3060.

I tried to think 5000 data in a learned network.
With 1660ti, the run time per run is constant and the progress ticks off steps at a visible rate of 1 at a constant rate and reaches 5000.

With the 3060, however,

  • “It takes 2 seconds to make one step”
  • “Complete 10 steps in an instant”

This behavior was repeated repeatedly.
In other words, sometimes it took a very long time to call nn.outputs, and sometimes it took a very short time.
(It is possible that this phenomenon just happened by chance.)


I wanted to measure the fine load in the GPU, but Nvidia’s tool did not work well.
I will continue to try.

You can try using something like jvisualvm or, if you’ve got the licence, yourkit. That way you don’t have to guess as much about what is going on and instead collect some actual performance metrics.

You should be able to see what exactly is taking time.

You should also do a sanity check: Take one of the dl4j examples and try to verify if you see the same kind of difference there. However, I’d suggest using an example that is not using MNIST data, as that particular data loader may be causing slowdowns in old dl4j versions.

1 Like

After years of struggling with this issue, I finally found the root cause — and it was partly my own mistake.

When I upgraded from beta7 to M2.1, I noticed that deeplearning4j-cuda-11.6 didn’t exist on Maven. Since the build failed with

that dependency, I simply removed it from my pom.xml. Everything seemed to work fine — the startup logs showed:

Loaded [JCublasBackend] backend

Backend used: [CUDA]

HAVE_CUDNN

Seeing HAVE_CUDNN in the output, I assumed cuDNN support had been merged into the main nd4j-cuda-11.6 artifact and that the

separate deeplearning4j-cuda module was no longer needed. This was wrong.

What actually happened: HAVE_CUDNN only means the ND4J native backend was compiled with cuDNN support. The actual cuDNN helper

classes (like CudnnLSTMHelper) that make LSTM use cuDNN’s fused kernel were in deeplearning4j-cuda — which was never released

for M2.1. On top of that, M2.1 changed HELPER_DISABLE_DEFAULT_VALUE to “true”, disabling all cuDNN helpers by default.

Without the cuDNN LSTM helper, every LSTM layer falls back to a Java implementation that processes each timestep as individual

GPU operations, instead of cuDNN’s single fused call for all timesteps. This made fit() roughly 40x slower — not because of the

GPU upgrade, but because of the DL4J version change.

The fix:

1. Port CudnnLSTMHelper and BaseCudnnHelper from the

deeplearning4j/deeplearning4j/deeplearning4j-cuda at 1.0.0-M1.1 · deeplearning4j/deeplearning4j · GitHub into your project under

org.deeplearning4j.cuda.recurrent. DL4J discovers helpers via reflection by class name, so it just needs to be on the

classpath.

2. Add a CudnnLSTMHelper(Object[]) constructor — HelperUtils.createHelper() in M2.1 wraps arguments in new Object[]{arguments},

so the original CudnnLSTMHelper(DataType) constructor won’t be found.

3. Set System.setProperty(“org.eclipse.deeplearning4j.helpers.disable”, “false”) before network initialization.

4. Place cuDNN 8.3.x DLLs + zlibwapi.dll in the JavaCPP cache directory

(~/.javacpp/cache/cuda-11.6-8.3-1.5.7-windows-x86_64.jar/org/bytedeco/cuda/windows-x86_64/) and add that directory to your

system PATH.

Result: training speed went from ~30 samples/sec to ~850 samples/sec on RTX 4000 Ada.

—

I want to apologize for taking up the community’s time with this issue — the root cause turned out to be my own

misunderstanding of the dependency structure. I should have investigated the missing deeplearning4j-cuda module more thoroughly

instead of assuming it had been merged. Thank you to everyone who took the time to respond to my original post back in 2022.

Your suggestions helped point me in the right direction, even though it took me a few more years to finally pin down the exact

cause. I hope this solution helps anyone else who encounters the same issue with DL4J M2.1.