Debugging Memory Issues in Java Application

Hey Guys,

So we have a simple toy application (loads some word vectors, does some inference), and we’ve noticed that physical memory usage is increasing. We believe that there is some memory leak occuring. We’re wondering how we would go about debugging it. We want to use tools like valgrind, but there really isn’t much documentation on what the best practices are. We think the error is possibly coming from JavaCPP but we’d like your input on how to diagnose the problem.

At a high level, we notice that through JMX, we see the heap and off heap memory stay stable. But the system memory for the Java process appears to increase in output of the top program. We have identified that it does occur when we start inference. The other parts of the application don’t seem to cause it. As we continue to do inference, it will eventually throw an OutOfMemoryError. Its clear that this is some native memory issue.

To be specific, the javaCPP physicalBytes appears to be increasing and eventually exceeds the javaCPP maxPhysicalBytes parameter. Essentially we want to identify the source of this issue and fix it. Could it be that the deallocator is not running as often as it should? Is it not fast enough at deallocating memory perhaps? We’re not really sure how we can answer these questions. I’m hoping you guys have any insights on how we can debug this issue.

Put a System.gc() after you run your inference. If the memory stays stable, then it is really the deallocator that isn’t running often enough. The problem can be usually reigned in though the use of workspaces, which enables your application to reuse the memory.

Then you should take a look at your application and if you are holding on to INDArray references somewhere, like putting them into a collection or something like that. While you hold on to those references their memory can’t be freed, so your memory leak is likely there.

If you don’t hold on to any references, then you should first try snapshots, and see if it is due to a bug that was already fixed.

If even with all those suggestions, you couldn’t solve the problem, we will probably need a small self-contained demo project where we can try to isolate and reproduce the behavior.

Alright intersting, so the System.gc() do help stabilize it. But I’m curious how this will scale. At the moment each inference call happens one after the other. But in reality, I’m curious what will happen when we parallelize the inference calls (like with a tomcat webserver).

Can you explain in a bit more detail how System.gc() will help stabilize the memory? Here is my understanding so far.

  • JavaCPP makes native allocations so java’s garbage collector shouldn’t be able to deallocate that memory. Instead the codebase using JavaCPP must be manually calling in deallocation in some way?

  • System.gc() doesn’t actually trigger the garbage collector right? It simpy tells the JVM it wants the garbage collector to run. But ultimately, its up to the JVM to make the decision. Is it possible that even if we call System.gc() after each inference call in a multithreaded setting, the JVM’s garbage collector will simpy not be fast enough/run as often leading to the same problem before?

When you are not using workspaces, each INDArray is a reference to some off-heap memory. Only once that reference is garbage collected will JavaCPP actually deallocate the memory.

That is true, but the JVM usually starts GC pretty soon after that call.

Yes, in theory that is possible, in practice it will probabaly just slow down your application.

In such a setting you should probably use ParallelInference, so you get the best possible performance.

You should now look into using Workspaces. I guess you will not be passing on the NDArray, but instead use it to create some kind of result, be it just an class index or a class label that you want to return. That means, that you can probably use a fixed memory size to run your inference, and that in turn will remove the need for calling System.gc().

Thanks @treo. @zamlz I think we’re almost done implementing these suggestions.

We are also trying experiments using jemalloc. I will send the graph that was produced by jeprof, @treo is that helpful here for diagnosis?

Looking at different flavors of linux, it looks like the problem may be worse on RHEL 7.7 than Ubuntu 16.04 but needs further testing to confirm. We hadn’t seen this problem before in Windows but no real load testing was done, so possibly it had slipped through the cracks.

We already know what the problem is: You have lots of short lived allocations and they aren’t cleared up automatically because there isn’t a lot of pressure on the GC. The tool that is provided by DL4J / ND4J for this exact case is Workspaces.

This is not a memory leak, because System.gc() alleviates your problem.

@treo we are trying with workspaces now, from snapshot. So it sounds like using workspaces, we should see a stable value for Pointer.physicalBytes(), even without calling System.gc for each call to net.output.

previously in a test we saw OutOfMemory thrown from Pointer class when using workspace, but it may have been that maxbytes was set very low. also it was using beta6.

to be clear, this is how we are using workspaces: the following is done for each request sent to a tomcat app.

try(MemoryWorkspace ws = Nd4j.getWorkspaceManager().getAndActivateWorkspace(initialConfig, "SOME_ID")) {
    		INDArray out;
    	    INDArray input;
    		input = dsFromString(txtToClassify);
    	
	    	if (numServed % 1000 == 0) {
		    	logger.info("after getting input");
		    	logBytesInfo();
	    	}
	    	out = CNNModel.pi.output(input);
        }

the full code is here: simple-tomcat-setup/CNNModel.java at master · tc64/simple-tomcat-setup · GitHub

Yes. You should see a more or less stable value.

e.g. if you run the following example, you will see that even though it creates lots and lots of large arrays, the actual memory use stays constant. If you do it outside the workspace you will see a sawtooth pattern. Also note that it runs a lot quicker with workspaces, as it reuses memory instead of allocating new memory.

WorkspaceConfiguration learningConfig = WorkspaceConfiguration.builder()
                .policyAllocation(AllocationPolicy.STRICT) // <-- this option disables overallocation behavior
                .policyLearning(LearningPolicy.FIRST_LOOP) // <-- this option makes workspace learning after first loop
                .build();

        for (int x = 0; x < 10000000; x++) {
            try (MemoryWorkspace ws = Nd4j.getWorkspaceManager().getAndActivateWorkspace(learningConfig, "OTHER_ID")) {
                Nd4j.create(1024, 1024, 25);
            }
            if(x%1000 == 0){
                System.out.println(Pointer.physicalBytes());
            }
        }

With Workspaces:

Without Workspaces: