How to increase Batch size

chaoyongyue · September 16, 2020, 9:06am

When I used multiple GPUs to train VGG, I increased the batch size, but I found that the program would have errors and show insufficient memory. However, these memory is sufficient on tensorflow. The pictures below show the errors

agibsonccc · September 16, 2020, 9:20am

"This happens on tensorflow and “this happens in dl4j” doesn’t really help us in anyway. We don’t know what you were using or what the context was.
Could you give us something we can run to compare for ourselves?

chaoyongyue · September 16, 2020, 9:21am

agibsonccc · September 16, 2020, 9:22am

Right I know what your error is. I want your code you were using in tensorflow to compare. Give me 2 side by side runnable scripts you actually ran yourself and I can tell you more.

chaoyongyue · September 16, 2020, 9:28am

Thanks for your reply. This is part of the code。 I think it may be that the memory allocation method is different during the training process, but I don’t know how dl4j allocates memory in the multi-GPU training era.

agibsonccc · September 16, 2020, 10:20am

Ah, parallelwrapper has mutliple copies of the model in memory. That’s why. ParallelWrapper is mainly for multi GPU. Go ahead and just use singular dl4j and you should be fine.

chaoyongyue · September 16, 2020, 10:57am

Well.I am sorry. I can’t understand where I need to modify the code specifically .I use two GPUs to train vgg network. So if I can’t use the parallelwrapper,how can I train vgg for multiGPU with large batch size?

agibsonccc · September 16, 2020, 1:09pm

Oh I see that was on purpose. Sorry I missed that in your post.
Ensure your number of workers equals number of gpus and adjust your prefetch buffer to be less. That might be what’s taking up your memory.

chaoyongyue · September 16, 2020, 2:26pm

I tried it but failed again. So the parallelwrapper’s
methods of allocating memory may be different from others. Any other ways to take up memory?

Topic		Replies	Views
How to dynamically load the required batch into GPU memory? DL4J	2	50	November 26, 2024
How to customize a dataset iterator that supports multiple GPUs? DL4J	8	92	July 18, 2024
Modify example in order to get larger than 244x244 resolution on combination image DL4J	0	341	January 11, 2021
Bert: Allocation failed: [[DEVICE] allocation failed; Error code: [2]] DL4J	2	355	January 26, 2022
Perfomance issue DL4J	8	457	October 2, 2020

How to increase Batch size

Related topics