Hi
I’m running a Bert-model imported via TFGraphMapper
in conjunction with BertIterator
.
I was able to complete training on a subset of the data (5% in my case), but when I increase the size of the data-set to 10% I can still start the training but after approximately 12 hours (during the the 2nd epoch) I receive the following error: Allocation failed: [[DEVICE] allocation failed; Error code: [2]]
Currently I’m using the following code, to to prevent OOM exceptions on the GPU:
Nd4j.getMemoryManager().togglePeriodicGc(BooleanUtility.nvl(customConfig.getTogglePeriodicGc(), true));
Nd4j.getMemoryManager().setAutoGcWindow(ObjectUtility.nvl(customConfig.getAutoGcWindow(), 1000));
Nd4j.getAffinityManager().allowCrossDeviceAccess(BooleanUtility.nvl(customConfig.getAllowCrossDeviceAccess(), true));
I also fiddled around with -Dorg.bytedeco.javacpp.maxbytes
to set the off-heap memory size. When I looked at the regular JVM heap-memory it seemed to be constant and far below the max.
Decreasing the batch size doesn’t help either.
My best guess is a memory leak, since the available memory seems to decrease over time, as if not all memory was released between epochs i.e. batches.
The stacktrace I get is:
java.lang.RuntimeException: Allocation failed: [[DEVICE] allocation failed; Error code: [2]]
at org.nd4j.nativeblas.OpaqueDataBuffer.allocateDataBuffer(OpaqueDataBuffer.java:76)
at org.nd4j.linalg.jcublas.buffer.BaseCudaDataBuffer.initPointers(BaseCudaDataBuffer.java:393)
at org.nd4j.linalg.jcublas.buffer.BaseCudaDataBuffer.<init>(BaseCudaDataBuffer.java:409)
at org.nd4j.linalg.jcublas.buffer.BaseCudaDataBuffer.<init>(BaseCudaDataBuffer.java:470)
at org.nd4j.linalg.jcublas.buffer.CudaFloatDataBuffer.<init>(CudaFloatDataBuffer.java:68)
at org.nd4j.linalg.jcublas.buffer.factory.CudaDataBufferFactory.createFloat(CudaDataBufferFactory.java:345)
at org.nd4j.linalg.factory.Nd4j.createBufferDetachedImpl(Nd4j.java:1329)
at org.nd4j.linalg.factory.Nd4j.createBufferDetached(Nd4j.java:1319)
at org.nd4j.linalg.jcublas.JCublasNDArrayFactory.createUninitializedDetached(JCublasNDArrayFactory.java:1543)
at org.nd4j.linalg.factory.Nd4j.createUninitializedDetached(Nd4j.java:4428)
at org.nd4j.linalg.factory.Nd4j.createUninitializedDetached(Nd4j.java:4435)
at org.nd4j.autodiff.samediff.internal.memory.ArrayCacheMemoryMgr.allocate(ArrayCacheMemoryMgr.java:135)
at org.nd4j.autodiff.samediff.internal.InferenceSession.getAndParameterizeOp(InferenceSession.java:894)
at org.nd4j.autodiff.samediff.internal.InferenceSession.getAndParameterizeOp(InferenceSession.java:60)
at org.nd4j.autodiff.samediff.internal.AbstractSession.output(AbstractSession.java:385)
at org.nd4j.autodiff.samediff.internal.TrainingSession.trainingIteration(TrainingSession.java:127)
at org.nd4j.autodiff.samediff.SameDiff.fitHelper(SameDiff.java:1713)
at org.nd4j.autodiff.samediff.SameDiff.fit(SameDiff.java:1569)
at org.nd4j.autodiff.samediff.SameDiff.fit(SameDiff.java:1509)
at org.nd4j.autodiff.samediff.config.FitConfig.exec(FitConfig.java:172)
at org.nd4j.autodiff.samediff.SameDiff.fit(SameDiff.java:1524)
Any help would be much appreciated!