I have been using 1660ti and cuda10.2 with DL4J 1.0.0beta7.
We have now changed to 3060, cuda11.6, and DL4J 1.0.0M2.1.
The CUDA version was increased because the GPU is not compatible with 10.2.
The reason we raised the DL4J version is because DL4J does not support CUDA 11 series.
In other words, this is the only configuration that can be tested in the current environment.
Driver is up-to-date.
cuda and cudnn are included with maven.
Even though the GPU has become much more powerful, the learning speed has dropped to about 1/3 of what it was before.
The GPU usage is 100% in windows task manager, but in GPU-Z, the GPU Load is about 20%.
I measured the execution speed of the network.fit() method, and it took a lot of time, 2.5-4 seconds even for a small mini-batch on a very simple network.
The cause is completely due to the slow execution of the fit method, though,
The cause is unknown because even with a very simple test, execution is still slow.
CUDA is recognized by DL4J,
I tried to run ND4J.create() many times to see if it would increase the GPU memory.
What could be the cause?
I tried it on a simple network, so it is inconceivable that it would take 2.5-4 seconds.
There must be a problem elsewhere.
On a very simple network,
It is also important to note that the fit method call takes the same amount of time for both mini-batch 32 and mini-batch 256.
The mini-batch 256 is naturally more computationally intensive, but for some reason it takes almost the same amount of time to call the fit method.