I have a box with 12 CPU cores (24 threads with HT) and two Nvidia GeForce RTX 2080 cards. I am using DL4J Beta 6. I have a simple sequential model with 32 nodes in the first hidden layer and 16 nodes in the 2nd hidden layer. When I train it with CPU only, all 24 threads are running. But when I switch to GPU, I only see two threads are running and the progress is much slower than the CPU training. I tried all the diagnosis steps here:
https://deeplearning4j.konduit.ai/config/backends/performance-issues
but still could not figure why it is so slow. The GPU utilization is only about 6% on one of them (the other one is 0%). Any other possible thing that I might have missed? My OS is Ubuntu 18.04. Thanks for pointers.
Your model is very small, so almost all time is just spent on shuffling data between GPU and CPU.
When you have a model that is that small using the CPU is usually more efficient.
Also, multiple GPU’s will only be used if you are using a parallel wrapper for training
e.g.: https://github.com/KonduitAI/deeplearning4j-examples/blob/master/dl4j-cuda-specific-examples/src/main/java/org/deeplearning4j/examples/multigpu/MultiGpuLenetMnistExample.java#L107-L115
In any case, ND4J will only use one Backend at a time, so you can’t train on both CPU and GPU at the same time within the same process.
Paul,
Thanks for the insights. It makes sense. I also found out that my application level driver code was not invoking the GPU training properly. I was sending too many concurrent training requests to the GPU. It turned out only the first two requests survived. The other requests ran out memory with the GPU. So only two threads end up running from the top level. Plus the overhead that you mentioned in moving data between CPU and GPU, it made the training on GPU slower. I’ll stay with CPU training with small model. Thank you.