Hi,
We are using Deeplearning4j for making predictions where model is trained by Keras and imported to Deeplearning4j. Version is 1.0.0-beta6 [1]. We are using CPU backend, our CPU supports AVX2 and AVX512 instructions [2].
The code below is called by 40 threads each of which has its own CompuationGraph (no sharing). As we make predictions, off-heap memory usage keeps increasing and at some point OutOfMemoryError occurs:
Physical memory usage is too high: physicalBytes (341G) > maxPhysicalBytes (340G)
I couldn’t find the reason for this. How can I find what part of the code is leaking memory? Or what should I use to prevent memory leaks?
I added destroyAllWorkspacesForCurrentThread() when 80% threshold limit reached, however even though this code is called from each thread, memory keeps around 300G (see logging below code)
Program paramerers:
-XX:+UseG1GC -Xms16g -Xmx100g -Dorg.bytedeco.javacpp.maxbytes=240G -Dorg.bytedeco.javacpp.maxphysicalbytes=250G
Running code:
private final WorkspaceConfiguration learningConfig = WorkspaceConfiguration.builder()
.policyAllocation(AllocationPolicy.STRICT) // <-- this option disables overallocation behavior
.policyLearning(LearningPolicy.FIRST_LOOP) // <-- this option makes workspace learning after first loop
.build();
private void predict(ComputationGraph graph, float[] input1, float[] input2) {
// called by 16 threas.
long start = System.currentTimeMillis();
try(MemoryWorkspace ws = Nd4j.getWorkspaceManager().getAndActivateWorkspace(learningConfig, "WORKSPACE_ID")) {
INDArray firstInput = Nd4j.create(input1, SHAPE);
INDArray secondInput = Nd4j.create(input2, SHAPE);
long startForPredictions = System.currentTimeMillis();
INDArray result = graph.output(false, ws, firstInput, secondInput)[0];
// process is almost equivalent to no op for testing
process(result);
long end = System.currentTimeMillis();
logger.info("Time took {} ms, prediction took {}", end - start, end - startForPredictions);
}
}
Graph import:
ComputationGraph graph = KerasModelImport.importKerasModelAndWeights(
Paths.get(modelDirectory, "model.json").toString(),
Paths.get(modelDirectory, "model_weights.h5").toString());
Logging code:
logger.info("Physical bytes used by deeplearning4j: {} ({}), available bytes: {}", Pointer.physicalBytes(), Pointer.formatBytes(Pointer.physicalBytes()), Pointer.availablePhysicalBytes());
GC config:
Nd4j.getMemoryManager().setAutoGcWindow(10000);
[1] 1.0.0-beta7 gave an error during prections for imported model, therefore we delayed migration.
[2] We have declared nd4j-native avx2 and avx512 dependendencies in our pom.xml