Bert: Allocation failed: [[DEVICE] allocation failed; Error code: [2]]

I’m running a Bert-model imported via TFGraphMapper in conjunction with BertIterator.
I was able to complete training on a subset of the data (5% in my case), but when I increase the size of the data-set to 10% I can still start the training but after approximately 12 hours (during the the 2nd epoch) I receive the following error: Allocation failed: [[DEVICE] allocation failed; Error code: [2]]

Currently I’m using the following code, to to prevent OOM exceptions on the GPU:

    Nd4j.getMemoryManager().togglePeriodicGc(BooleanUtility.nvl(customConfig.getTogglePeriodicGc(), true));
    Nd4j.getMemoryManager().setAutoGcWindow(ObjectUtility.nvl(customConfig.getAutoGcWindow(), 1000));
    Nd4j.getAffinityManager().allowCrossDeviceAccess(BooleanUtility.nvl(customConfig.getAllowCrossDeviceAccess(), true));

I also fiddled around with -Dorg.bytedeco.javacpp.maxbytes to set the off-heap memory size. When I looked at the regular JVM heap-memory it seemed to be constant and far below the max.

Decreasing the batch size doesn’t help either.
My best guess is a memory leak, since the available memory seems to decrease over time, as if not all memory was released between epochs i.e. batches.

The stacktrace I get is:

java.lang.RuntimeException: Allocation failed: [[DEVICE] allocation failed; Error code: [2]]
	at org.nd4j.nativeblas.OpaqueDataBuffer.allocateDataBuffer(
	at org.nd4j.linalg.jcublas.buffer.BaseCudaDataBuffer.initPointers(
	at org.nd4j.linalg.jcublas.buffer.BaseCudaDataBuffer.<init>(
	at org.nd4j.linalg.jcublas.buffer.BaseCudaDataBuffer.<init>(
	at org.nd4j.linalg.jcublas.buffer.CudaFloatDataBuffer.<init>(
	at org.nd4j.linalg.jcublas.buffer.factory.CudaDataBufferFactory.createFloat(
	at org.nd4j.linalg.factory.Nd4j.createBufferDetachedImpl(
	at org.nd4j.linalg.factory.Nd4j.createBufferDetached(
	at org.nd4j.linalg.jcublas.JCublasNDArrayFactory.createUninitializedDetached(
	at org.nd4j.linalg.factory.Nd4j.createUninitializedDetached(
	at org.nd4j.linalg.factory.Nd4j.createUninitializedDetached(
	at org.nd4j.autodiff.samediff.internal.memory.ArrayCacheMemoryMgr.allocate(
	at org.nd4j.autodiff.samediff.internal.InferenceSession.getAndParameterizeOp(
	at org.nd4j.autodiff.samediff.internal.InferenceSession.getAndParameterizeOp(
	at org.nd4j.autodiff.samediff.internal.AbstractSession.output(
	at org.nd4j.autodiff.samediff.internal.TrainingSession.trainingIteration(
	at org.nd4j.autodiff.samediff.SameDiff.fitHelper(
	at org.nd4j.autodiff.samediff.config.FitConfig.exec(

Any help would be much appreciated!

Bert can quickly require a lot of RAM. Are you using a GPU? How much memory do you have?

Ah, I see, that it is in the 2nd epoch, so the per batch memory isn’t your actual problem.

Are you doing anything special with the BertIterator? Do you see a growing memory usage even with your small subset of data?

Yes we are using a Tesla T4 16Gb and we do have 394Gb of Ram.

The increasing memory is an issue regardless of data-set size, meaning we have to over-dimensionalize our GPU, to ensure we can complete the training.