Only two threads are running when switching training to GPU

Your model is very small, so almost all time is just spent on shuffling data between GPU and CPU.

When you have a model that is that small using the CPU is usually more efficient.

Also, multiple GPU’s will only be used if you are using a parallel wrapper for training
e.g.: https://github.com/KonduitAI/deeplearning4j-examples/blob/master/dl4j-cuda-specific-examples/src/main/java/org/deeplearning4j/examples/multigpu/MultiGpuLenetMnistExample.java#L107-L115

In any case, ND4J will only use one Backend at a time, so you can’t train on both CPU and GPU at the same time within the same process.