GPU Performance is worse than CPU

I am training 2 Auto-Encoders.
One on CPU and the other on GPU.
With same training data types.

See below 2 logs(iteration time intervals).

OS: Windows 11
DL4J Version: M2.1

Please help!

CPU LOG:

GPU LOG:

This usually happens when you the GPU is used very inefficiently. For example, if you have a very small batch size or if you have a very small model. In those cases the transfer latency between the host system and the GPU dominates the overall time taken.

Can you share some more information about how big your model is?

thanks for your response.

below is more information.
do you need more?


DataSetIterator cachedDsIter = new ExistingMiniBatchDataSetIterator(cacheDir, "dataset-%d.bin");
    // dataset-%d.bin: 168KB
// totalMiniBatchCount: 18074
// miniBatchSize: 64
   AsyncDataSetIterator asyncDsIter = new AsyncDataSetIterator(cachedDsIter, 64);

.
.
.

double learningRatePositive = 0.001D;
int featureCount = 64;

    MultiLayerConfiguration config = new NeuralNetConfiguration.Builder()
		.trainingWorkspaceMode(WorkspaceMode.ENABLED)
		.inferenceWorkspaceMode(WorkspaceMode.ENABLED)
		.cacheMode(CacheMode.DEVICE)

        .seed(seed)
        .optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
        .weightInit(WeightInit.XAVIER)
        .activation(Activation.TANH)
        .updater(Nadam.builder().learningRate(learningRate).build())
        .l2(0.0005)
        .list()
        .layer(0, new LSTM.Builder().activation(Activation.TANH).nIn(featureCount).nOut(featureCount / 4).build())
        .layer(1, new LSTM.Builder().activation(Activation.TANH).nIn(featureCount / 4).nOut(featureCount / 4 / 4).build())
        .layer(2, new LSTM.Builder().activation(Activation.TANH).nIn(featureCount / 4 / 4).nOut(featureCount / 4 / 4 / 4).build())
        .layer(3, new LSTM.Builder().activation(Activation.TANH).nIn(featureCount / 4 / 4 / 4).nOut(featureCount / 4 / 4).build())
        .layer(4, new LSTM.Builder().activation(Activation.TANH).nIn(featureCount / 4 / 4).nOut(featureCount / 4).build())
        .layer(5, new RnnOutputLayer.Builder().activation(Activation.IDENTITY).nIn(featureCount / 4).nOut(featureCount).lossFunction(LossFunction.MSE).build())
        .build();

LSTMs tend to amplify the problem I talked about in my last post.

You’ve got a mini batch size of 64 entries and 64 features with just a few layers, that have to work mostly sequentially.

That means your gpu is doing essentially no work between waiting for data to move.

You could try a larger batch size, given the comments in the code you’ve posted here, you can probably fit your entire dataset into gpu memory at once.

ok, i will try.
thanks.