Bert: Allocation failed: [[DEVICE] allocation failed; Error code: [2]]

Hi
I’m running a Bert-model imported via TFGraphMapper in conjunction with BertIterator.
I was able to complete training on a subset of the data (5% in my case), but when I increase the size of the data-set to 10% I can still start the training but after approximately 12 hours (during the the 2nd epoch) I receive the following error: Allocation failed: [[DEVICE] allocation failed; Error code: [2]]

Currently I’m using the following code, to to prevent OOM exceptions on the GPU:

    Nd4j.getMemoryManager().togglePeriodicGc(BooleanUtility.nvl(customConfig.getTogglePeriodicGc(), true));
    Nd4j.getMemoryManager().setAutoGcWindow(ObjectUtility.nvl(customConfig.getAutoGcWindow(), 1000));
    Nd4j.getAffinityManager().allowCrossDeviceAccess(BooleanUtility.nvl(customConfig.getAllowCrossDeviceAccess(), true));

I also fiddled around with -Dorg.bytedeco.javacpp.maxbytes to set the off-heap memory size. When I looked at the regular JVM heap-memory it seemed to be constant and far below the max.

Decreasing the batch size doesn’t help either.
My best guess is a memory leak, since the available memory seems to decrease over time, as if not all memory was released between epochs i.e. batches.

The stacktrace I get is:

java.lang.RuntimeException: Allocation failed: [[DEVICE] allocation failed; Error code: [2]]
	at org.nd4j.nativeblas.OpaqueDataBuffer.allocateDataBuffer(OpaqueDataBuffer.java:76)
	at org.nd4j.linalg.jcublas.buffer.BaseCudaDataBuffer.initPointers(BaseCudaDataBuffer.java:393)
	at org.nd4j.linalg.jcublas.buffer.BaseCudaDataBuffer.<init>(BaseCudaDataBuffer.java:409)
	at org.nd4j.linalg.jcublas.buffer.BaseCudaDataBuffer.<init>(BaseCudaDataBuffer.java:470)
	at org.nd4j.linalg.jcublas.buffer.CudaFloatDataBuffer.<init>(CudaFloatDataBuffer.java:68)
	at org.nd4j.linalg.jcublas.buffer.factory.CudaDataBufferFactory.createFloat(CudaDataBufferFactory.java:345)
	at org.nd4j.linalg.factory.Nd4j.createBufferDetachedImpl(Nd4j.java:1329)
	at org.nd4j.linalg.factory.Nd4j.createBufferDetached(Nd4j.java:1319)
	at org.nd4j.linalg.jcublas.JCublasNDArrayFactory.createUninitializedDetached(JCublasNDArrayFactory.java:1543)
	at org.nd4j.linalg.factory.Nd4j.createUninitializedDetached(Nd4j.java:4428)
	at org.nd4j.linalg.factory.Nd4j.createUninitializedDetached(Nd4j.java:4435)
	at org.nd4j.autodiff.samediff.internal.memory.ArrayCacheMemoryMgr.allocate(ArrayCacheMemoryMgr.java:135)
	at org.nd4j.autodiff.samediff.internal.InferenceSession.getAndParameterizeOp(InferenceSession.java:894)
	at org.nd4j.autodiff.samediff.internal.InferenceSession.getAndParameterizeOp(InferenceSession.java:60)
	at org.nd4j.autodiff.samediff.internal.AbstractSession.output(AbstractSession.java:385)
	at org.nd4j.autodiff.samediff.internal.TrainingSession.trainingIteration(TrainingSession.java:127)
	at org.nd4j.autodiff.samediff.SameDiff.fitHelper(SameDiff.java:1713)
	at org.nd4j.autodiff.samediff.SameDiff.fit(SameDiff.java:1569)
	at org.nd4j.autodiff.samediff.SameDiff.fit(SameDiff.java:1509)
	at org.nd4j.autodiff.samediff.config.FitConfig.exec(FitConfig.java:172)
	at org.nd4j.autodiff.samediff.SameDiff.fit(SameDiff.java:1524)

Any help would be much appreciated!

Bert can quickly require a lot of RAM. Are you using a GPU? How much memory do you have?

Edit:
Ah, I see, that it is in the 2nd epoch, so the per batch memory isn’t your actual problem.

Are you doing anything special with the BertIterator? Do you see a growing memory usage even with your small subset of data?

Yes we are using a Tesla T4 16Gb and we do have 394Gb of Ram.

The increasing memory is an issue regardless of data-set size, meaning we have to over-dimensionalize our GPU, to ensure we can complete the training.