NaN on arm server

I get NaN on arm server when training / predicting using dl4j 1.0.0.-SNAPSHOT compiled from GitHub - KonduitAI/deeplearning4j: Eclipse Deeplearning4j, ND4J, DataVec and more - deep learning & linear algebra for Java/Scala with GPUs + Spark.
The enviroment is same with Run on ARM cpu server and Compiling on ARM .

Here’s a simple program that caculate the sum of x and y. It can predict the sum result (x + y = [[0.5550]]) on windows using dl4j beta6, and predict NaN on arm server using dl4j 1.0.0.-SNAPSHOT.

Add this code

Nd4j.getExecutioner().setProfilingMode(OpExecutioner.ProfilingMode.NAN_PANIC);

and get the error infomation:

Warning: Versions of org.bytedeco:javacpp:1.5.2 and org.bytedeco:openblas:0.3.9-1.5.3-SNAPSHOT do not match.
[WARNING]
org.nd4j.linalg.exception.ND4JOpProfilerException: P.A.N.I.C.! Op.Z() contains 30 NaN value(s):
at org.nd4j.linalg.api.ops.executioner.OpExecutionerUtil.checkForNaN (OpExecutionerUtil.java:61)
at org.nd4j.linalg.api.ops.executioner.OpExecutionerUtil.checkForAny (OpExecutionerUtil.java:65)
at org.nd4j.linalg.api.blas.impl.BaseLevel3.gemm (BaseLevel3.java:77)
at org.nd4j.linalg.api.ndarray.BaseNDArray.mmuli (BaseNDArray.java:3156)
at org.deeplearning4j.nn.layers.BaseLayer.preOutputWithPreNorm (BaseLayer.java:316)

After changing version 1.5.2 to 1.5.3, i get new error:

[WARNING]
org.nd4j.linalg.exception.ND4JOpProfilerException: P.A.N.I.C.! Op.Z() contains 30 NaN value(s):
at org.nd4j.linalg.api.ops.executioner.OpExecutionerUtil.checkForNaN (OpExecutionerUtil.java:61)
at org.nd4j.linalg.api.ops.executioner.OpExecutionerUtil.checkForAny (OpExecutionerUtil.java:65)
at org.nd4j.linalg.api.blas.impl.BaseLevel3.gemm (BaseLevel3.java:77)
at org.nd4j.linalg.api.ndarray.BaseNDArray.mmuli (BaseNDArray.java:3156)
at org.deeplearning4j.nn.layers.BaseLayer.preOutputWithPreNorm (BaseLayer.java:316)
at org.deeplearning4j.nn.layers.BaseLayer.preOutput (BaseLayer.java:289)
at org.deeplearning4j.nn.layers.BaseLayer.activate (BaseLayer.java:337)
at org.deeplearning4j.nn.layers.AbstractLayer.activate (AbstractLayer.java:257)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.ffToLayerActivationsInWs (MultiLayerNetwork.java:1132)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.computeGradientAndScore (MultiLayerNetwork.java:2750)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.computeGradientAndScore (MultiLayerNetwork.java:2708)
at org.deeplearning4j.optimize.solvers.BaseOptimizer.gradientAndScore (BaseOptimizer.java:170)
at org.deeplearning4j.optimize.solvers.StochasticGradientDescent.optimize (StochasticGradientDescent.java:63)
at org.deeplearning4j.optimize.Solver.optimize (Solver.java:52)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fitHelper (MultiLayerNetwork.java:2309)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fit (MultiLayerNetwork.java:2267)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fit (MultiLayerNetwork.java:2330)

@liweigu does this happen with the same model on an x86 box?

@agibsonccc I guess the answer to your question is yes.

It takes time to verify that, i’ll reply after the test.
But it’s ok on windows x86 for beta6.

When building nd4j (in GitHub - KonduitAI/deeplearning4j: Eclipse Deeplearning4j, ND4J, DataVec and more - deep learning & linear algebra for Java/Scala with GPUs + Spark) with -Djavacpp.platform=linux-x86_64 on cpu server, i get error

[ERROR] Failed to execute goal on project nd4j-tensorflow: Could not resolve dependencies for project org.nd4j:nd4j-tensorflow:jar:1.0.0-SNAPSHOT: Failure to find org.bytedeco:tensorflow:jar:linux-x86_64:1.15.2-1.5.3-20200324.074704-327 in https://oss.sonatype.org/content/repositories/snapshots was cached in the local repository, resolution will not be reattempted until the update interval of sonatype-nexus-snapshots has elapsed or updates are forced → [Help 1]

It blocks me.

We need to build with the --update-snapshots option as well, like mvn --update-snapshots -Djavacpp.platform=linux-x86_64 clean ...

It can compile with --update-snapshots -Djavacpp.platform=linux-x86_64 .

And the testing program can train and predict with no NaN, it can print the result:
x + y = [[0.5548]]
It seems the NaN only occurs on arm server.
@treo @agibsonccc

Very interesting.

Can you write simple test for matrix multiplication, or use one of our tests in order to check what’s going on there?

I.e. this test:

    public void testMMul() {
        INDArray arr = Nd4j.create(new double[][] {{1, 2, 3}, {4, 5, 6}});

        INDArray assertion = Nd4j.create(new double[][] {{14, 32}, {32, 77}});

        INDArray test = arr.mmul(arr.transpose());
        assertEquals(getFailureMessage(), assertion, test);
    }

@raver119
The nd4j test is ok, it can print test’s value right.
assertion = [[ 14.0000, 32.0000],
[ 32.0000, 77.0000]]
test = [[ 14.0000, 32.0000],
[ 32.0000, 77.0000]]
( I changed the code to System.out.println("test = " + test); )

But my program still gets ND4JOpProfilerException as above in the same machine and in the same java project.

Hmmm… Are you using some cloud provider? Is there any chance we could get access to similar ARM machine?

From Run on ARM cpu server - #6 by liweigu

Also: can you please print out your input data and labels before feeding them into neural network?

For this line
int nEpochs = 500;
If set it to 50, there is no NaN, and it can print the result x + y = [[0.4639]]
So the NaN doesn’t come out in the 1st epoch on training.

I put the data at

Hmmm… so it sounds like hardware-specific overflow then… We’ll definitely need to set up ARM machine and see what’s up there.

@liweigu can you file a github issue please? We need to get to the bottom here

OK.

Thank you very much. We’ll check it out.