Significant Performance Drop in `nn.fit()` method After Upgrading to NVIDIA 3060 and CUDA 11.6

I have been using 1660ti and cuda10.2 with DL4J 1.0.0beta7.
We have now changed to 3060, cuda11.6, and DL4J 1.0.0M2.1.

memo
The CUDA version was increased because the GPU is not compatible with 10.2.
The reason we raised the DL4J version is because DL4J does not support CUDA 11 series.
In other words, this is the only configuration that can be tested in the current environment.
Driver is up-to-date.
cuda and cudnn are included with maven.

Even though the GPU has become much more powerful, the learning speed has dropped to about 1/3 of what it was before.

The GPU usage is 100% in windows task manager, but in GPU-Z, the GPU Load is about 20%.

I measured the execution speed of the network.fit() method, and it took a lot of time, 2.5-4 seconds even for a small mini-batch on a very simple network.

The cause is completely due to the slow execution of the fit method, though,
The cause is unknown because even with a very simple test, execution is still slow.

CUDA is recognized by DL4J,
I tried to run ND4J.create() many times to see if it would increase the GPU memory.

What could be the cause?

I tried it on a simple network, so it is inconceivable that it would take 2.5-4 seconds.
There must be a problem elsewhere.

On a very simple network,
It is also important to note that the fit method call takes the same amount of time for both mini-batch 32 and mini-batch 256.
The mini-batch 256 is naturally more computationally intensive, but for some reason it takes almost the same amount of time to call the fit method.

When I tried training the same network from 3060 to 1660, I found that 1660 was 10 times faster than 3060.
The call to the fit method takes 500 milliseconds.

At 3060 it was 5 seconds.

@Nasyumaro let me take a look at this in the next version. A new release is pending. I"m spending a lot of time on cuda QA now. In the mean time could you file a reproducer on github? Sign in to GitHub · GitHub thanks!

1 Like

No need to read

<dependency>
        <groupId>org.bytedeco</groupId>
        <artifactId>cuda-platform-redist</artifactId>
        <version>11.4-8.2-1.5.6</version>
</dependency>

I tried using ↑ this due to the following concerns, but the problem was not resolved.

It has been confirmed in the log that the CUDA version etc. have been switched.

Concerns that were not actually a problem (No need to read)

Here, CUDA11.6 and CuDNN8.3 are defined as a combination.

however,

According to this site, CuDNN8.3 corresponding to CUDA11.6 does not exist.

@Nasyumaro it does here: Maven Central Repository Search this is literally what was released. Looking at our build file this is the cudnn we use:
https://developer.download.nvidia.com/compute/redist/cudnn/v8.3.2/local_installers/11.5/cudnn-linux-x86_64-8.3.2.44_cuda11.5-archive.tar.xz

These are the builds that it was released with, see the build script:

1 Like

I would like to provide the code, but the program is so large that I cannot provide a fragment that can be tested… Let me think about it a little more…

With the exact same program and network
This phenomenon occurs just by changing the GPU and CUDA.
(learning speed is reduced to 1/10 even though change the GPU to high)

For now, the test network is as follows

  • The total parameter for input is about 60.
  • The time step of input is 120
  • Output is 2
val conf: ComputationGraphConfiguration = NeuralNetConfiguration.Builder()
.seed(LearningMain.rand.nextLong())
.weightInit(WeightInit.XAVIER)
.miniBatch(true)
.updater(Adam())
.optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
.graphBuilder()
.addInputs(inputFunc.map { it.name })

.addLayer(
    "${fc}_LSTM1", LSTM.Builder()
        .nIn(inputFunc3D.sumBy { it.numColumn }).nOut(128)
        .activation(Activation.TANH)
        .build(),
    *inputFunc3D.map { it.name }.toTypedArray()
)

.addLayer(
    "${fc}_LSTM2", LSTM.Builder()
        .nIn(128).nOut(128)
        .activation(Activation.TANH)
        .build(),
    "${fc}_LSTM1"
)
.addLayer(
    "LTS", LastTimeStep(LSTM.Builder()
        .activation(Activation.TANH)
        .nIn(128).nOut(128)
        .build()),
    "${fc}_LSTM2"
)

.addLayer(
    "FC", DenseLayer.Builder()
        .activation(Activation.TANH)
        .nIn(128 + 1).nOut(128)
        .build(),
    "LTS", "centerWidth2D"
)

.addLayer("MCXENT", OutputLayer.Builder(LossFunctions.LossFunction.MCXENT)
    .activation(Activation.SOFTMAX)
    .nIn(128).nOut(outputNames[0].size)
    .build(),
    "FC"
)

.setOutputs("MCXENT")
.build()

val nn = ComputationGraph(conf)
nn.init()

If we made the network roughly twice as many layers, it would take twice as long to run the fit method.

The number of input and output parameters has not changed except for increasing the middle layer of the network, so the data that is supposed to be transferred to the GPU every epoch has not increased.

Of course, as the network becomes more complex, calling fit becomes slower.
However, since the network is poor, I don’t think it would take this long to calculate the matrix.

From this, I thought that something caused by the increase in layers (overhead time that spans CPU<->GPU that occurs for each layer within an epoch) may be the cause.


I tried again to measure the execution time with the same network, changing only the number of batches.
I trained 32 and 512 batches and the execution time of the fit method did not change at all.

i think this is also strange.
maybe there seems to be a problem outside of the matrix calculation.


I tried changing the GC settings and there was no change.

Nd4j.getMemoryManager().togglePeriodicGc(false)
Nd4j.getMemoryManager().autoGcWindow = 10000

And I also tried changing the CacheType of the workspace, but there was no change here either.


In the text, Windows Task Manager shows 100% GPU utilization and GPU-Z shows 25% GPU Load, but there was probably no problem in that respect.
The 1660ti showed similar figures.
It seems to be just a matter of difference in the measurement method.


I remember another strange phenomenon that occurred.

This was not during learning, but during operation with 3060.

I tried to think 5000 data in a learned network.
With 1660ti, the run time per run is constant and the progress ticks off steps at a visible rate of 1 at a constant rate and reaches 5000.

With the 3060, however,

  • “It takes 2 seconds to make one step”
  • “Complete 10 steps in an instant”

This behavior was repeated repeatedly.
In other words, sometimes it took a very long time to call nn.outputs, and sometimes it took a very short time.
(It is possible that this phenomenon just happened by chance.)


I wanted to measure the fine load in the GPU, but Nvidia’s tool did not work well.
I will continue to try.

You can try using something like jvisualvm or, if you’ve got the licence, yourkit. That way you don’t have to guess as much about what is going on and instead collect some actual performance metrics.

You should be able to see what exactly is taking time.

You should also do a sanity check: Take one of the dl4j examples and try to verify if you see the same kind of difference there. However, I’d suggest using an example that is not using MNIST data, as that particular data loader may be causing slowdowns in old dl4j versions.

1 Like