Search for "Off-Heap" Memory Leak

Hi there

I’m eventually observing an off-heap memory leak using SameDiff. I’ve checked the heap memory with jvisualvm and it seems to stay consistent on a certain level. Meanwhile, the overall memory consumption of the process is continuously increasing until I finally get a memory exception.

Caused by: java.lang.OutOfMemoryError: Cannot allocate new LongPointer(2): totalBytes = 256, physicalBytes = 61440M
        at org.bytedeco.javacpp.LongPointer.<init>(LongPointer.java:88)
        at org.bytedeco.javacpp.LongPointer.<init>(LongPointer.java:53)
        at org.nd4j.linalg.jcublas.ops.executioner.CudaOpContext.setIArguments(CudaOpContext.java:68)
        at org.nd4j.autodiff.samediff.internal.InferenceSession.getAndParameterizeOp(InferenceSession.java:867)
        at org.nd4j.autodiff.samediff.internal.InferenceSession.getAndParameterizeOp(InferenceSession.java:60)
        at org.nd4j.autodiff.samediff.internal.AbstractSession.output(AbstractSession.java:385)
        at org.nd4j.autodiff.samediff.SameDiff.directExecHelper(SameDiff.java:2579)
        at org.nd4j.autodiff.samediff.SameDiff.evaluateHelper(SameDiff.java:2163)
        at org.nd4j.autodiff.samediff.SameDiff.evaluate(SameDiff.java:2073)
        at org.nd4j.autodiff.samediff.config.EvaluationConfig.exec(EvaluationConfig.java:197)
        at org.nd4j.autodiff.samediff.SameDiff.evaluate(SameDiff.java:2045)
        at ***.learn.shared.nn.evaluation.listener.samediff.SameDiffEvaluationListener.epochEnd(SameDiffEvaluationListener.java:62)
        at org.nd4j.autodiff.samediff.SameDiff.fitHelper(SameDiff.java:1757)
        at org.nd4j.autodiff.samediff.SameDiff.fit(SameDiff.java:1569)
        at org.nd4j.autodiff.samediff.SameDiff.fit(SameDiff.java:1509)
        at org.nd4j.autodiff.samediff.config.FitConfig.exec(FitConfig.java:172)
        at org.nd4j.autodiff.samediff.SameDiff.fit(SameDiff.java:1524)
        at ***
        at ***
        at ***
        at ***
        at ***
        at java.base/java.util.ArrayList$Itr.forEachRemaining(ArrayList.java:1033)
        at java.base/java.util.Collections$UnmodifiableCollection$1.forEachRemaining(Collections.java:1054)
        at ***
        ... 14 common frames omitted
Caused by: java.lang.OutOfMemoryError: Physical memory usage is too high: physicalBytes (61440M) > maxPhysicalBytes (61440M)
        at org.bytedeco.javacpp.Pointer.deallocator(Pointer.java:700)
        at org.bytedeco.javacpp.Pointer.init(Pointer.java:126)
        at org.bytedeco.javacpp.LongPointer.allocateArray(Native Method)
        at org.bytedeco.javacpp.LongPointer.<init>(LongPointer.java:80)
        ... 38 common frames omitted

The question of interest would more be, how you normally proceed with such problems? I could imagine, that you quite regularly debug such memory leaks, but since it is proprietary code, I can’t share it very easily.

Additionally, the error actually suggests an OOM on the heap, what firstly leaded me into a (likely) wrong direction. Is this correct or have I misinterpreted the exception?

Thank you for any insights!

Best,
Nino

The first thing that we would typically do here is to check if it is an actual memory leak, or a race condition with the garbage collector.

Off-heap memory is only freed when the referencing objects are garbage collected. You probably have lots of RAM and therefore the heap itself is probably very large and mostly empty. This results in very little garbage collection pressure.

If I recall correctly, we also don’t force a garbage collection at regular time intervals anymore.

To see if that is actually the source of your problems, putting a System.gc() into your Evaluation listener should help.

If it does help, you can configure your GC to run more frequently, or try to figure out where you might want to use workspaces so it doesn’t need to run.

Another thing you should be able to collect from JVisualVM alone, is whether you have a rising pointer object count. That would indicate that you are holding on to those pointer objects and thereby leaking memory.

If triggering the GC doesn’t help, and you find that your pointer counts stay flat, there may be a memory leak in the native code, and that is a lot more difficult, as you’d have to instrument the application with valgrind to figure out where things are leaking, and when working with the JVM and late deallocation with the GC, it gets exponentially harder to figure out what exactly is going on.

Thank you very much for the immediate response, I’ll follow your suggestions.

What supprises me, is that I already have those three lines in my BERT model setup code:

    Nd4j.getMemoryManager().togglePeriodicGc(true);
    Nd4j.getMemoryManager().setAutoGcWindow(1000);
    Nd4j.getAffinityManager().allowCrossDeviceAccess(true);

I would have expected, that System.gc() is invoced every 1000ms during my long running training. Am I missing something there?

In that case the GC option can be already ruled out. Try to to see if you are holding on to pointers with the memory profiler then.

Hi. Did you check actually that those calls really happen? I think it’s not been supported by the default Samediff's cache manager (you’ll have to track which one but it extends AbstractMemoryMgr).

I also had issues with memory leaks, which were quite conspicuous on Linux. Running System.gc() helped only temporarily. But upgrading to M1 and using linux-x86_64-onednn-avx2 classifier for native-backend helped a lot. On Windows I didn’t notice huge memory leaks.