How to customize a dataset iterator that supports multiple GPUs?

How to customize a dataset iterator that supports multiple GPUs?

Dataset iterators produce dataset objects, they don’t really care where you are going to use them down the line.

As far as I’m aware, the ParallelWrapper implementation for training on multiple GPUs has also been dropped.

What are you trying to do?

My code:

        ParallelWrapper mutilGPUWrapper = null;
        if (mutilGPU) {
            mutilGPUWrapper = new ParallelWrapper.Builder(model)
                    .prefetchBuffer(prefetchBufferMutilGPU)
                    .workers(workersMutilGPU)
                    .averagingFrequency(avgFrequencyMutilGPU)
                    .reportScoreAfterAveraging(false)
                    .build();
        }
        
        for (int i = 0; i < nEpochs; i++) {
            trainIterator.reset();
            if (mutilGPU) {
                mutilGPUWrapper.fit(trainIterator);                
            } else {
                model.fit(trainIterator);
            }
            System.out.println("==No." + i + " nEpochs, " //
                    + LocalTime.now() + ", model.score=" + model.score());
        }

output :

==No.0 nEpochs, 09:54:45.681890090, model.score=0.2083020586814072
==No.1 nEpochs, 09:55:39.290619835, model.score=0.19151817910537472
Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException: Op [adam_updater] execution failed
	at org.deeplearning4j.parallelism.ParallelWrapper.fit(ParallelWrapper.java:590)
	at com.cq.aifocusstocks.train.RnnPredictModel.train(RnnPredictModel.java:173)
	at com.cq.aifocusstocks.train.CnnLstmRegPredictor.trainModel(CnnLstmRegPredictor.java:222)
	at com.cq.aifocusstocks.train.TrainCnnLstmModel.main(TrainCnnLstmModel.java:15)
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: Op [adam_updater] execution failed
	at org.deeplearning4j.parallelism.trainer.DefaultTrainer.waitTillRunning(DefaultTrainer.java:468)
	at org.deeplearning4j.parallelism.ParallelWrapper.fit(ParallelWrapper.java:588)
	... 3 more
Caused by: java.lang.RuntimeException: Op [adam_updater] execution failed
	at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.exec(CudaExecutioner.java:1881)
	at org.nd4j.linalg.factory.Nd4j.exec(Nd4j.java:6545)
	at org.nd4j.linalg.learning.AdamUpdater.applyUpdater(AdamUpdater.java:110)
	at org.deeplearning4j.nn.updater.UpdaterBlock.update(UpdaterBlock.java:162)
	at org.deeplearning4j.nn.updater.UpdaterBlock.updateExternalGradient(UpdaterBlock.java:128)
	at org.deeplearning4j.nn.updater.BaseMultiLayerUpdater.update(BaseMultiLayerUpdater.java:320)
	at org.deeplearning4j.nn.updater.BaseMultiLayerUpdater.update(BaseMultiLayerUpdater.java:247)
	at org.deeplearning4j.optimize.solvers.BaseOptimizer.updateGradientAccordingToParams(BaseOptimizer.java:309)
	at org.deeplearning4j.optimize.solvers.BaseOptimizer.gradientAndScore(BaseOptimizer.java:186)
	at org.deeplearning4j.optimize.solvers.StochasticGradientDescent.optimize(StochasticGradientDescent.java:61)
	at org.deeplearning4j.optimize.Solver.optimize(Solver.java:52)
	at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fitHelper(MultiLayerNetwork.java:2357)
	at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fit(MultiLayerNetwork.java:2315)
	at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fit(MultiLayerNetwork.java:2378)
	at org.deeplearning4j.parallelism.trainer.DefaultTrainer.fit(DefaultTrainer.java:233)
	at org.deeplearning4j.parallelism.trainer.DefaultTrainer.run(DefaultTrainer.java:382)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at org.deeplearning4j.parallelism.ParallelWrapper$2$1.run(ParallelWrapper.java:156)
	at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.RuntimeException: adamUpdater: cuda stream synchronization failed !; Error code: [700]
	at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.exec(CudaExecutioner.java:2067)
	at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.exec(CudaExecutioner.java:1870)
	... 19 more
Exception in thread "DeallocatorServiceThread_4" java.lang.RuntimeException: java.lang.RuntimeException: [DEVICE] deallocation failed; Error code: [700]
	at org.nd4j.linalg.api.memory.deallocation.DeallocatorService$DeallocatorServiceThread.run(DeallocatorService.java:151)
Caused by: java.lang.RuntimeException: [DEVICE] deallocation failed; Error code: [700]
	at org.nd4j.linalg.jcublas.bindings.Nd4jCuda.deleteDataBuffer(Native Method)
	at org.nd4j.jita.allocator.impl.CudaDeallocator.deallocate(CudaDeallocator.java:40)
	at org.nd4j.linalg.api.memory.deallocation.DeallocatorService$DeallocatorServiceThread.run(DeallocatorService.java:146)
Exception in thread "DeallocatorServiceThread_7" Exception in thread "DeallocatorServiceThread_1" Exception in thread "DeallocatorServiceThread_5" java.lang.RuntimeException: java.lang.RuntimeException: [DEVICE] deallocation failed; Error code: [700]
	at org.nd4j.linalg.api.memory.deallocation.DeallocatorService$DeallocatorServiceThread.run(DeallocatorService.java:151)
Caused by: java.lang.RuntimeException: [DEVICE] deallocation failed; Error code: [700]
	at org.nd4j.linalg.jcublas.bindings.Nd4jCuda.deleteDataBuffer(Native Method)
	at org.nd4j.jita.allocator.impl.CudaDeallocator.deallocate(CudaDeallocator.java:40)
	at org.nd4j.linalg.api.memory.deallocation.DeallocatorService$DeallocatorServiceThread.run(DeallocatorService.java:146)
Exception in thread "DeallocatorServiceThread_0" Exception in thread "DeallocatorServiceThread_3" Exception in thread "DeallocatorServiceThread_2" java.lang.RuntimeException: java.lang.RuntimeException: [DEVICE] deallocation failed; Error code: [700]
	at org.nd4j.linalg.api.memory.deallocation.DeallocatorService$DeallocatorServiceThread.run(DeallocatorService.java:151)
Caused by: java.lang.RuntimeException: [DEVICE] deallocation failed; Error code: [700]
	at org.nd4j.linalg.jcublas.bindings.Nd4jCuda.deleteDataBuffer(Native Method)
	at org.nd4j.jita.allocator.impl.CudaDeallocator.deallocate(CudaDeallocator.java:40)
	at org.nd4j.linalg.api.memory.deallocation.DeallocatorService$DeallocatorServiceThread.run(DeallocatorService.java:146)
Exception in thread "DeallocatorServiceThread_6" java.lang.RuntimeException: java.lang.RuntimeException: [DEVICE] deallocation failed; Error code: [700]
	at org.nd4j.linalg.api.memory.deallocation.DeallocatorService$DeallocatorServiceThread.run(DeallocatorService.java:151)
Caused by: java.lang.RuntimeException: [DEVICE] deallocation failed; Error code: [700]
	at org.nd4j.linalg.jcublas.bindings.Nd4jCuda.deleteDataBuffer(Native Method)
	at org.nd4j.jita.allocator.impl.CudaDeallocator.deallocate(CudaDeallocator.java:40)
	at org.nd4j.linalg.api.memory.deallocation.DeallocatorService$DeallocatorServiceThread.run(DeallocatorService.java:146)
java.lang.RuntimeException: java.lang.RuntimeException: [DEVICE] deallocation failed; Error code: [700]
	at org.nd4j.linalg.api.memory.deallocation.DeallocatorService$DeallocatorServiceThread.run(DeallocatorService.java:151)
Caused by: java.lang.RuntimeException: [DEVICE] deallocation failed; Error code: [700]
	at org.nd4j.linalg.jcublas.bindings.Nd4jCuda.deleteDataBuffer(Native Method)
	at org.nd4j.jita.allocator.impl.CudaDeallocator.deallocate(CudaDeallocator.java:40)
	at org.nd4j.linalg.api.memory.deallocation.DeallocatorService$DeallocatorServiceThread.run(DeallocatorService.java:146)
java.lang.RuntimeException: java.lang.RuntimeException: [DEVICE] deallocation failed; Error code: [700]
	at org.nd4j.linalg.api.memory.deallocation.DeallocatorService$DeallocatorServiceThread.run(DeallocatorService.java:151)
Caused by: java.lang.RuntimeException: [DEVICE] deallocation failed; Error code: [700]
09:55:40.010 [ParallelWrapper training thread 2] ERROR org.deeplearning4j.parallelism.ParallelWrapper - Uncaught exception: java.lang.RuntimeException: [DEVICE] deallocation failed; Error code: [700]
09:55:40.010 [ParallelWrapper training thread 0] ERROR org.deeplearning4j.parallelism.ParallelWrapper - Uncaught exception: java.lang.RuntimeException: java.lang.RuntimeException: Op [adam_updater] execution failed
	at org.nd4j.linalg.jcublas.bindings.Nd4jCuda.deleteDataBuffer(Native Method)
	at org.nd4j.jita.allocator.impl.CudaDeallocator.deallocate(CudaDeallocator.java:40)
	at org.nd4j.linalg.api.memory.deallocation.DeallocatorService$DeallocatorServiceThread.run(DeallocatorService.java:146)
java.lang.RuntimeException: [DEVICE] deallocation failed; Error code: [700]
	at org.nd4j.linalg.jcublas.bindings.Nd4jCuda.dbClose(Native Method)
	at org.nd4j.nativeblas.OpaqueDataBuffer.closeBuffer(OpaqueDataBuffer.java:219)
	at org.nd4j.linalg.jcublas.buffer.BaseCudaDataBuffer.release(BaseCudaDataBuffer.java:1814)
	at org.nd4j.linalg.api.buffer.BaseDataBuffer.close(BaseDataBuffer.java:1946)
	at org.nd4j.linalg.api.ndarray.BaseNDArray.close(BaseNDArray.java:5654)
	at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.close(MultiLayerNetwork.java:4148)
	at org.deeplearning4j.parallelism.trainer.DefaultTrainer.run(DefaultTrainer.java:452)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at org.deeplearning4j.parallelism.ParallelWrapper$2$1.run(ParallelWrapper.java:156)
	at java.base/java.lang.Thread.run(Thread.java:833)
java.lang.RuntimeException: java.lang.RuntimeException: [DEVICE] deallocation failed; Error code: [700]
	at org.nd4j.linalg.api.memory.deallocation.DeallocatorService$DeallocatorServiceThread.run(DeallocatorService.java:151)
Caused by: java.lang.RuntimeException: [DEVICE] deallocation failed; Error code: [700]
	at org.nd4j.linalg.jcublas.bindings.Nd4jCuda.deleteDataBuffer(Native Method)
	at org.nd4j.jita.allocator.impl.CudaDeallocator.deallocate(CudaDeallocator.java:40)
	at org.nd4j.linalg.api.memory.deallocation.DeallocatorService$DeallocatorServiceThread.run(DeallocatorService.java:146)
java.lang.RuntimeException: java.lang.RuntimeException: Op [adam_updater] execution failed
	at org.deeplearning4j.parallelism.trainer.DefaultTrainer.run(DefaultTrainer.java:446)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at org.deeplearning4j.parallelism.ParallelWrapper$2$1.run(ParallelWrapper.java:156)
	at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.RuntimeException: Op [adam_updater] execution failed
	at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.exec(CudaExecutioner.java:1881)
	at org.nd4j.linalg.factory.Nd4j.exec(Nd4j.java:6545)
	at org.nd4j.linalg.learning.AdamUpdater.applyUpdater(AdamUpdater.java:110)
	at org.deeplearning4j.nn.updater.UpdaterBlock.update(UpdaterBlock.java:162)
	at org.deeplearning4j.nn.updater.UpdaterBlock.updateExternalGradient(UpdaterBlock.java:128)
	at org.deeplearning4j.nn.updater.BaseMultiLayerUpdater.update(BaseMultiLayerUpdater.java:320)
	at org.deeplearning4j.nn.updater.BaseMultiLayerUpdater.update(BaseMultiLayerUpdater.java:247)
	at org.deeplearning4j.optimize.solvers.BaseOptimizer.updateGradientAccordingToParams(BaseOptimizer.java:309)
	at org.deeplearning4j.optimize.solvers.BaseOptimizer.gradientAndScore(BaseOptimizer.java:186)
	at org.deeplearning4j.optimize.solvers.StochasticGradientDescent.optimize(StochasticGradientDescent.java:61)
	at org.deeplearning4j.optimize.Solver.optimize(Solver.java:52)
	at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fitHelper(MultiLayerNetwork.java:2357)
	at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fit(MultiLayerNetwork.java:2315)
	at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fit(MultiLayerNetwork.java:2378)
	at org.deeplearning4j.parallelism.trainer.DefaultTrainer.fit(DefaultTrainer.java:233)
	at org.deeplearning4j.parallelism.trainer.DefaultTrainer.run(DefaultTrainer.java:382)
	... 4 more
Caused by: java.lang.RuntimeException: adamUpdater: cuda stream synchronization failed !; Error code: [700]
	at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.exec(CudaExecutioner.java:2067)
	at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.exec(CudaExecutioner.java:1870)
	... 19 more
java.lang.RuntimeException: java.lang.RuntimeException: [DEVICE] deallocation failed; Error code: [700]
	at org.nd4j.linalg.api.memory.deallocation.DeallocatorService$DeallocatorServiceThread.run(DeallocatorService.java:151)
Caused by: java.lang.RuntimeException: [DEVICE] deallocation failed; Error code: [700]
	at org.nd4j.linalg.jcublas.bindings.Nd4jCuda.deleteDataBuffer(Native Method)
	at org.nd4j.jita.allocator.impl.CudaDeallocator.deallocate(CudaDeallocator.java:40)
	at org.nd4j.linalg.api.memory.deallocation.DeallocatorService$DeallocatorServiceThread.run(DeallocatorService.java:146)
09:55:40.374 [ParallelWrapper training thread 3] ERROR org.deeplearning4j.parallelism.ParallelWrapper - Uncaught exception: java.lang.RuntimeException: [DEVICE] deallocation failed; Error code: [700]
java.lang.RuntimeException: [DEVICE] deallocation failed; Error code: [700]
	at org.nd4j.linalg.jcublas.bindings.Nd4jCuda.dbClose(Native Method)
	at org.nd4j.nativeblas.OpaqueDataBuffer.closeBuffer(OpaqueDataBuffer.java:219)
	at org.nd4j.linalg.jcublas.buffer.BaseCudaDataBuffer.release(BaseCudaDataBuffer.java:1814)
	at org.nd4j.linalg.api.buffer.BaseDataBuffer.close(BaseDataBuffer.java:1946)
	at org.nd4j.linalg.api.ndarray.BaseNDArray.close(BaseNDArray.java:5654)
	at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.close(MultiLayerNetwork.java:4148)
	at org.deeplearning4j.parallelism.trainer.DefaultTrainer.run(DefaultTrainer.java:452)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at org.deeplearning4j.parallelism.ParallelWrapper$2$1.run(ParallelWrapper.java:156)
	at java.base/java.lang.Thread.run(Thread.java:833)

“If put new ParallelWrapper.Builder(model) inside the loop, the exception will no longer be thrown. However, the training results differ significantly from those using a single GPU.”

ParallelWrapper mutilGPUWrapper = null; 
        
        for (int i = 0; i < nEpochs; i++) {
            trainIterator.reset();
            if (mutilGPU) {
                mutilGPUWrapper = new ParallelWrapper.Builder(model)
                    .prefetchBuffer(prefetchBufferMutilGPU)
                    .workers(workersMutilGPU)
                    .averagingFrequency(avgFrequencyMutilGPU)
                    .reportScoreAfterAveraging(false)
                    .build();
                mutilGPUWrapper.fit(trainIterator);                
            } else {
                model.fit(trainIterator);
            }
            System.out.println("==No." + i + " nEpochs, " //
                    + LocalTime.now() + ", model.score=" + model.score());
        }

[/quote]

@cqiaoYc that’s due to parameter averaging. If you increase the averaging frequency it will be closer to single gpu but be slower.

@agibsonccc avgFrequencyMutilGPU=1, but the training results still differ significantly from those using a single GPU. Is there an issue with putting new ParallelWrapper.Builder(model) inside the loop?

@cqiaoYc that’s not really going to help. Can you clarify what you’ve tried with your averaging frequency?

@agibsonccc The differences in the in-sample results are relatively large, but this may not be a problem. The out-of-sample test results are very close. I think multi-GPU training might have a similar effect to dropout.

@cqiaoYc yes it can. Just tweak the averaging as best you can to balance speed/throughput and performance.