Bert: Allocation failed: [[DEVICE] allocation failed; Error code: [2]]

fgi · January 26, 2022, 3:08pm

Hi
I’m running a Bert-model imported via TFGraphMapper in conjunction with BertIterator.
I was able to complete training on a subset of the data (5% in my case), but when I increase the size of the data-set to 10% I can still start the training but after approximately 12 hours (during the the 2nd epoch) I receive the following error: Allocation failed: [[DEVICE] allocation failed; Error code: [2]]

Currently I’m using the following code, to to prevent OOM exceptions on the GPU:

    Nd4j.getMemoryManager().togglePeriodicGc(BooleanUtility.nvl(customConfig.getTogglePeriodicGc(), true));
    Nd4j.getMemoryManager().setAutoGcWindow(ObjectUtility.nvl(customConfig.getAutoGcWindow(), 1000));
    Nd4j.getAffinityManager().allowCrossDeviceAccess(BooleanUtility.nvl(customConfig.getAllowCrossDeviceAccess(), true));

I also fiddled around with -Dorg.bytedeco.javacpp.maxbytes to set the off-heap memory size. When I looked at the regular JVM heap-memory it seemed to be constant and far below the max.

Decreasing the batch size doesn’t help either.
My best guess is a memory leak, since the available memory seems to decrease over time, as if not all memory was released between epochs i.e. batches.

The stacktrace I get is:

java.lang.RuntimeException: Allocation failed: [[DEVICE] allocation failed; Error code: [2]]
	at org.nd4j.nativeblas.OpaqueDataBuffer.allocateDataBuffer(OpaqueDataBuffer.java:76)
	at org.nd4j.linalg.jcublas.buffer.BaseCudaDataBuffer.initPointers(BaseCudaDataBuffer.java:393)
	at org.nd4j.linalg.jcublas.buffer.BaseCudaDataBuffer.<init>(BaseCudaDataBuffer.java:409)
	at org.nd4j.linalg.jcublas.buffer.BaseCudaDataBuffer.<init>(BaseCudaDataBuffer.java:470)
	at org.nd4j.linalg.jcublas.buffer.CudaFloatDataBuffer.<init>(CudaFloatDataBuffer.java:68)
	at org.nd4j.linalg.jcublas.buffer.factory.CudaDataBufferFactory.createFloat(CudaDataBufferFactory.java:345)
	at org.nd4j.linalg.factory.Nd4j.createBufferDetachedImpl(Nd4j.java:1329)
	at org.nd4j.linalg.factory.Nd4j.createBufferDetached(Nd4j.java:1319)
	at org.nd4j.linalg.jcublas.JCublasNDArrayFactory.createUninitializedDetached(JCublasNDArrayFactory.java:1543)
	at org.nd4j.linalg.factory.Nd4j.createUninitializedDetached(Nd4j.java:4428)
	at org.nd4j.linalg.factory.Nd4j.createUninitializedDetached(Nd4j.java:4435)
	at org.nd4j.autodiff.samediff.internal.memory.ArrayCacheMemoryMgr.allocate(ArrayCacheMemoryMgr.java:135)
	at org.nd4j.autodiff.samediff.internal.InferenceSession.getAndParameterizeOp(InferenceSession.java:894)
	at org.nd4j.autodiff.samediff.internal.InferenceSession.getAndParameterizeOp(InferenceSession.java:60)
	at org.nd4j.autodiff.samediff.internal.AbstractSession.output(AbstractSession.java:385)
	at org.nd4j.autodiff.samediff.internal.TrainingSession.trainingIteration(TrainingSession.java:127)
	at org.nd4j.autodiff.samediff.SameDiff.fitHelper(SameDiff.java:1713)
	at org.nd4j.autodiff.samediff.SameDiff.fit(SameDiff.java:1569)
	at org.nd4j.autodiff.samediff.SameDiff.fit(SameDiff.java:1509)
	at org.nd4j.autodiff.samediff.config.FitConfig.exec(FitConfig.java:172)
	at org.nd4j.autodiff.samediff.SameDiff.fit(SameDiff.java:1524)

Any help would be much appreciated!

treo · January 26, 2022, 4:02pm

Bert can quickly require a lot of RAM. Are you using a GPU? How much memory do you have?

Edit:
Ah, I see, that it is in the 2nd epoch, so the per batch memory isn’t your actual problem.

Are you doing anything special with the BertIterator? Do you see a growing memory usage even with your small subset of data?

fgi · January 26, 2022, 5:16pm

Yes we are using a Tesla T4 16Gb and we do have 394Gb of Ram.

The increasing memory is an issue regardless of data-set size, meaning we have to over-dimensionalize our GPU, to ensure we can complete the training.

Topic		Replies	Views
GPU memory usage for BERT in SameDiff is extremely high and grows with size of triaining set SameDiff	4	1264	June 18, 2020
Allocation failed: [[DEVICE] allocation failed; Error code: [2]] ND4J	4	520	May 24, 2022
CUDA - Failure to allocate bytes	1	1155	March 13, 2020
Std::bad_alloc error DL4J	8	551	January 3, 2022
Running out of GPU Memory Despite Setting Parameters DL4J	8	620	July 29, 2021

Bert: Allocation failed: [[DEVICE] allocation failed; Error code: [2]]

Related topics