Running out of GPU Memory Despite Setting Parameters

I’m encountering the following error over and over again:

Exception in thread “main” java.lang.OutOfMemoryError: Cannot allocate new FloatPointer(2048000): totalBytes = 109M, physicalBytes = 34827M
at org.bytedeco.javacpp.FloatPointer.(
at org.bytedeco.javacpp.FloatPointer.(
at org.nd4j.linalg.jcublas.buffer.BaseCudaDataBuffer.set(
at org.nd4j.linalg.jcublas.buffer.BaseCudaDataBuffer.setData(
at org.nd4j.linalg.factory.Nd4j.createTypedBuffer(
at org.nd4j.linalg.jcublas.JCublasNDArrayFactory.create(
at org.nd4j.linalg.factory.Nd4j.create(
at org.nd4j.linalg.factory.Nd4j.create(
Caused by: java.lang.OutOfMemoryError: Physical memory usage is too high: physicalBytes (34827M) > maxPhysicalBytes (34816M)
at org.bytedeco.javacpp.Pointer.deallocator(
at org.bytedeco.javacpp.Pointer.init(
at org.bytedeco.javacpp.FloatPointer.allocateArray(Native Method)
at org.bytedeco.javacpp.FloatPointer.(
… 13 more

java -Xms1G -Xms2G -Dorg.bytedeco.javacpp.cachedir=$TMPDIR -Dorg.bytedeco.javacpp.maxbytes=30G -Dorg.bytedeco.javacpp.maxphysicalbytes=34G

The gpu I’m using has 32 Gigs of memory and I have 137Gigs of RAM in the machine.

All I’m doing is calling the output method of three separate ComputationGraph objects over and over with different inputs. The output arrays are all been written over with each new output; I’m not collecting them. One thing that I’m noticing that is unexpected is that the GPU memory keeps increasing even though no new arrays are allocated from one iteration to the other. The input might be slightly bigger but does not increase consistently; the memory seems to grow consistently.

Any help with this issue would be greatly appreciated.

Thank you!


@treo Any thoughts on this? I’m still seeing this behavior and can’t find a solution. Thanks!

@joel-a have you tried calling System.gc() occasionally? Sometimes when people run OOM it’s just due to a race condition with the garbage collector not collecting in time.

@saudet might have more here for you.

@agibsonccc @saudet Thank you for the advice; I will try it.

That said, I don’t understand why the memory is growing in the first place, to 40982M in my last test, and exceeding the maxPhysicalBytes value (set to 40960M in the same run). Shouldn’t the underlying library realize it is reaching its limit and garbage collect before trying to allocate more memory?

@joel-a that’s not quite how it works. The JVM GC isn’t actually aware of the memory on the gpu. The way our off heap memory management works is you either trigger the gc, or you set a gc frequency. You can see a similar thread here on this very topic: Dl4j cuda 11.2 running out of memory on evaluation on ubuntu 20.04 - #12 by ajmakoni

@abisonccc I followed your advice and now I’m calling System.gc() after every iteration of the loop that runs inference on the DNN’s, collect the results in INDArray objects, and perform matrix operations on those INDArray’s. I see now that the GPU is stable, i.e. it’s not increasing without bound as it was before. However, the process memory keeps increasing constantly even though all my object references are re-used in the main loop, i.e. I’m not accumulating results in memory. Any idea why this is happening and resulting in the OOM error detailed previously? What am I missing?

@joel-a did you also try the thread I gave you with the gc auto frequency? Do you have a configuration we could look at maybe?

@agibsonccc I looked at the thread. I thought all the gc auto frequency did was call System.gc() with a certain period. Does it do something else I’m not aware of? I’m already calling System.gc() as soon as I finish with the arrays on every iteration. Regarding configuration, do you mean beyond the parameters I’m sending in the java -jar command? Please let me know how I get the configuration values you would like to see.

Thank you for your help!

@joel-a mainly something we could run to understand things like the size of the model, or something we could profile? If there’s an actual real issue, knowing the use case would help a ton. If cudnn is involved for example there are extra small components like descriptors that get created.

These kinds of issues are subtle so really just dumping anything we could run would help a lot.