Here’s a simple program that caculate the sum of x and y. It can predict the sum result (x + y = [[0.5550]]) on windows using dl4j beta6, and predict NaN on arm server using dl4j 1.0.0.-SNAPSHOT.
Warning: Versions of org.bytedeco:javacpp:1.5.2 and org.bytedeco:openblas:0.3.9-1.5.3-SNAPSHOT do not match.
[WARNING]
org.nd4j.linalg.exception.ND4JOpProfilerException: P.A.N.I.C.! Op.Z() contains 30 NaN value(s):
at org.nd4j.linalg.api.ops.executioner.OpExecutionerUtil.checkForNaN (OpExecutionerUtil.java:61)
at org.nd4j.linalg.api.ops.executioner.OpExecutionerUtil.checkForAny (OpExecutionerUtil.java:65)
at org.nd4j.linalg.api.blas.impl.BaseLevel3.gemm (BaseLevel3.java:77)
at org.nd4j.linalg.api.ndarray.BaseNDArray.mmuli (BaseNDArray.java:3156)
at org.deeplearning4j.nn.layers.BaseLayer.preOutputWithPreNorm (BaseLayer.java:316)
After changing version 1.5.2 to 1.5.3, i get new error:
[WARNING]
org.nd4j.linalg.exception.ND4JOpProfilerException: P.A.N.I.C.! Op.Z() contains 30 NaN value(s):
at org.nd4j.linalg.api.ops.executioner.OpExecutionerUtil.checkForNaN (OpExecutionerUtil.java:61)
at org.nd4j.linalg.api.ops.executioner.OpExecutionerUtil.checkForAny (OpExecutionerUtil.java:65)
at org.nd4j.linalg.api.blas.impl.BaseLevel3.gemm (BaseLevel3.java:77)
at org.nd4j.linalg.api.ndarray.BaseNDArray.mmuli (BaseNDArray.java:3156)
at org.deeplearning4j.nn.layers.BaseLayer.preOutputWithPreNorm (BaseLayer.java:316)
at org.deeplearning4j.nn.layers.BaseLayer.preOutput (BaseLayer.java:289)
at org.deeplearning4j.nn.layers.BaseLayer.activate (BaseLayer.java:337)
at org.deeplearning4j.nn.layers.AbstractLayer.activate (AbstractLayer.java:257)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.ffToLayerActivationsInWs (MultiLayerNetwork.java:1132)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.computeGradientAndScore (MultiLayerNetwork.java:2750)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.computeGradientAndScore (MultiLayerNetwork.java:2708)
at org.deeplearning4j.optimize.solvers.BaseOptimizer.gradientAndScore (BaseOptimizer.java:170)
at org.deeplearning4j.optimize.solvers.StochasticGradientDescent.optimize (StochasticGradientDescent.java:63)
at org.deeplearning4j.optimize.Solver.optimize (Solver.java:52)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fitHelper (MultiLayerNetwork.java:2309)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fit (MultiLayerNetwork.java:2267)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fit (MultiLayerNetwork.java:2330)
[ERROR] Failed to execute goal on project nd4j-tensorflow: Could not resolve dependencies for project org.nd4j:nd4j-tensorflow:jar:1.0.0-SNAPSHOT: Failure to find org.bytedeco:tensorflow:jar:linux-x86_64:1.15.2-1.5.3-20200324.074704-327 in https://oss.sonatype.org/content/repositories/snapshots was cached in the local repository, resolution will not be reattempted until the update interval of sonatype-nexus-snapshots has elapsed or updates are forced → [Help 1]
It can compile with --update-snapshots -Djavacpp.platform=linux-x86_64 .
And the testing program can train and predict with no NaN, it can print the result:
x + y = [[0.5548]]
It seems the NaN only occurs on arm server. @treo@agibsonccc
@raver119
The nd4j test is ok, it can print test’s value right.
assertion = [[ 14.0000, 32.0000],
[ 32.0000, 77.0000]]
test = [[ 14.0000, 32.0000],
[ 32.0000, 77.0000]]
( I changed the code to System.out.println("test = " + test); )
But my program still gets ND4JOpProfilerException as above in the same machine and in the same java project.
For this line int nEpochs = 500;
If set it to 50, there is no NaN, and it can print the result x + y = [[0.4639]]
So the NaN doesn’t come out in the 1st epoch on training.