Recommended way to create INDArray for prediction?


We are using Deeplearning4j for making predictions where model is trained by Keras and imported to Deeplearning4j. Version is 1.0.0-beta6 [1]. We are using CPU backend, our CPU supports AVX2 and AVX512 instructions [2].

Our code looks similar to this:

private static final int BATCH_SIZE = 4096;
private static final int INPUT_SIZE = 512;
private static final int [] SHAPE = { BATCH_SIZE, 1, INPUT_SIZE };

private void predict(ComputationGraph graph, float[] input1, float[] input2) {
    try(INDArray firstInput = Nd4j.create(input1, SHAPE);
        INDArray secondInput = Nd4j.create(input2, SHAPE)) {
        INDArray result = graph.outputSingle(firstInput, secondInput);

After some profiling and logging, we saw than Nd4j.create part is taking a lot longer than graph.outputSingle. Therefore, I thought we must be doing something wrong.

float arrays (input1, input2) are re-used, meaning we alllocate them once in application’s lifetime and fill data (override) in them for each batch.

So, is the recommended way of feeding data to graph for prediction and INDArray creation is to create them every time needed as we did in the code above? Or can we re-use them since the underlying arrays are already re-used? Or what is the best and most efficient way for this?

Could you please help and guide us in this issue?

[1] 1.0.0-beta7 gave an error during prections for imported model, therefore we delayed migration.
[2] We have declared nd4j-native avx2 and avx512 dependendencies in our pom.xml

Thanks in advance.

How exactly are you measuring the time? The first Nd4j call takes some time to initialize the whole system, so on cold starts, what ever you call will have the initialization overhead.

Other than that, you will very likely benefit from using workspaces, as that allows the system to reuse the memory for the both input arrays instead of having a constant alloc/dealloc going on.

For examples how to use workspaces see

You are right, after double-checking and running app longer, I realized that I must have been measuring time incorrectly. Most probably, as you said, I saw first initialization time. After running some time, there is no noticable overhead of nd4j call compared to prediction.

Thanks a lot for helping and also pointing workspaces out. I will look at workspaces, it looks like it is going to help us.

When using workspaces, should I close created INDArray or not? In the examples, they are not closed.

You don’t shouldn’t need to close them.

Thank you very much, you helped a lot in this topic.