I have a box with 12 CPU cores (24 threads with HT) and two Nvidia GeForce RTX 2080 cards. I am using DL4J Beta 6. I have a simple sequential model with 32 nodes in the first hidden layer and 16 nodes in the 2nd hidden layer. When I train it with CPU only, all 24 threads are running. But when I switch to GPU, I only see two threads are running and the progress is much slower than the CPU training. I tried all the diagnosis steps here:
but still could not figure why it is so slow. The GPU utilization is only about 6% on one of them (the other one is 0%). Any other possible thing that I might have missed? My OS is Ubuntu 18.04. Thanks for pointers.