Hi
I’m running a Bert-model imported via TFGraphMapper in conjunction with BertIterator.
I was able to complete training on a subset of the data (5% in my case), but when I increase the size of the data-set to 10% I can still start the training but after approximately 12 hours (during the the 2nd epoch) I receive the following error: Allocation failed: [[DEVICE] allocation failed; Error code: [2]]
Currently I’m using the following code, to to prevent OOM exceptions on the GPU:
    Nd4j.getMemoryManager().togglePeriodicGc(BooleanUtility.nvl(customConfig.getTogglePeriodicGc(), true));
    Nd4j.getMemoryManager().setAutoGcWindow(ObjectUtility.nvl(customConfig.getAutoGcWindow(), 1000));
    Nd4j.getAffinityManager().allowCrossDeviceAccess(BooleanUtility.nvl(customConfig.getAllowCrossDeviceAccess(), true));
I also fiddled around with -Dorg.bytedeco.javacpp.maxbytes to set the off-heap memory size. When I looked at the regular JVM heap-memory it seemed to be constant and far below the max.
Decreasing the batch size doesn’t help either.
My best guess is a memory leak, since the available memory seems to decrease over time, as if not all memory was released between epochs i.e. batches.
The stacktrace I get is:
java.lang.RuntimeException: Allocation failed: [[DEVICE] allocation failed; Error code: [2]]
	at org.nd4j.nativeblas.OpaqueDataBuffer.allocateDataBuffer(OpaqueDataBuffer.java:76)
	at org.nd4j.linalg.jcublas.buffer.BaseCudaDataBuffer.initPointers(BaseCudaDataBuffer.java:393)
	at org.nd4j.linalg.jcublas.buffer.BaseCudaDataBuffer.<init>(BaseCudaDataBuffer.java:409)
	at org.nd4j.linalg.jcublas.buffer.BaseCudaDataBuffer.<init>(BaseCudaDataBuffer.java:470)
	at org.nd4j.linalg.jcublas.buffer.CudaFloatDataBuffer.<init>(CudaFloatDataBuffer.java:68)
	at org.nd4j.linalg.jcublas.buffer.factory.CudaDataBufferFactory.createFloat(CudaDataBufferFactory.java:345)
	at org.nd4j.linalg.factory.Nd4j.createBufferDetachedImpl(Nd4j.java:1329)
	at org.nd4j.linalg.factory.Nd4j.createBufferDetached(Nd4j.java:1319)
	at org.nd4j.linalg.jcublas.JCublasNDArrayFactory.createUninitializedDetached(JCublasNDArrayFactory.java:1543)
	at org.nd4j.linalg.factory.Nd4j.createUninitializedDetached(Nd4j.java:4428)
	at org.nd4j.linalg.factory.Nd4j.createUninitializedDetached(Nd4j.java:4435)
	at org.nd4j.autodiff.samediff.internal.memory.ArrayCacheMemoryMgr.allocate(ArrayCacheMemoryMgr.java:135)
	at org.nd4j.autodiff.samediff.internal.InferenceSession.getAndParameterizeOp(InferenceSession.java:894)
	at org.nd4j.autodiff.samediff.internal.InferenceSession.getAndParameterizeOp(InferenceSession.java:60)
	at org.nd4j.autodiff.samediff.internal.AbstractSession.output(AbstractSession.java:385)
	at org.nd4j.autodiff.samediff.internal.TrainingSession.trainingIteration(TrainingSession.java:127)
	at org.nd4j.autodiff.samediff.SameDiff.fitHelper(SameDiff.java:1713)
	at org.nd4j.autodiff.samediff.SameDiff.fit(SameDiff.java:1569)
	at org.nd4j.autodiff.samediff.SameDiff.fit(SameDiff.java:1509)
	at org.nd4j.autodiff.samediff.config.FitConfig.exec(FitConfig.java:172)
	at org.nd4j.autodiff.samediff.SameDiff.fit(SameDiff.java:1524)
Any help would be much appreciated!