Your model is very small, so almost all time is just spent on shuffling data between GPU and CPU.
When you have a model that is that small using the CPU is usually more efficient.
Also, multiple GPU’s will only be used if you are using a parallel wrapper for training
e.g.: https://github.com/KonduitAI/deeplearning4j-examples/blob/master/dl4j-cuda-specific-examples/src/main/java/org/deeplearning4j/examples/multigpu/MultiGpuLenetMnistExample.java#L107-L115
In any case, ND4J will only use one Backend at a time, so you can’t train on both CPU and GPU at the same time within the same process.