NaN on arm server

liweigu · March 27, 2020, 2:08am

I get NaN on arm server when training / predicting using dl4j 1.0.0.-SNAPSHOT compiled from GitHub - KonduitAI/deeplearning4j: Eclipse Deeplearning4j, ND4J, DataVec and more - deep learning & linear algebra for Java/Scala with GPUs + Spark.
The enviroment is same with Run on ARM cpu server and Compiling on ARM .

Here’s a simple program that caculate the sum of x and y. It can predict the sum result (x + y = [[0.5550]]) on windows using dl4j beta6, and predict NaN on arm server using dl4j 1.0.0.-SNAPSHOT.

gist.github.com

https://gist.github.com/liweigu/bfcbd2a6e612e02ad7b24fee3cfac235

Arm Test

package xxxxx;

import java.util.Collections;
import java.util.List;
import java.util.Random;

import org.deeplearning4j.datasets.iterator.impl.ListDataSetIterator;
import org.deeplearning4j.nn.api.OptimizationAlgorithm;
import org.deeplearning4j.nn.conf.NeuralNetConfiguration;
import org.deeplearning4j.nn.conf.layers.DenseLayer;

This file has been truncated. show original

liweigu · March 27, 2020, 3:00am

Add this code

Nd4j.getExecutioner().setProfilingMode(OpExecutioner.ProfilingMode.NAN_PANIC);

and get the error infomation:

Warning: Versions of org.bytedeco:javacpp:1.5.2 and org.bytedeco:openblas:0.3.9-1.5.3-SNAPSHOT do not match.
[WARNING]
org.nd4j.linalg.exception.ND4JOpProfilerException: P.A.N.I.C.! Op.Z() contains 30 NaN value(s):
at org.nd4j.linalg.api.ops.executioner.OpExecutionerUtil.checkForNaN (OpExecutionerUtil.java:61)
at org.nd4j.linalg.api.ops.executioner.OpExecutionerUtil.checkForAny (OpExecutionerUtil.java:65)
at org.nd4j.linalg.api.blas.impl.BaseLevel3.gemm (BaseLevel3.java:77)
at org.nd4j.linalg.api.ndarray.BaseNDArray.mmuli (BaseNDArray.java:3156)
at org.deeplearning4j.nn.layers.BaseLayer.preOutputWithPreNorm (BaseLayer.java:316)

liweigu · March 27, 2020, 3:05am

After changing version 1.5.2 to 1.5.3, i get new error:

[WARNING]
org.nd4j.linalg.exception.ND4JOpProfilerException: P.A.N.I.C.! Op.Z() contains 30 NaN value(s):
at org.nd4j.linalg.api.ops.executioner.OpExecutionerUtil.checkForNaN (OpExecutionerUtil.java:61)
at org.nd4j.linalg.api.ops.executioner.OpExecutionerUtil.checkForAny (OpExecutionerUtil.java:65)
at org.nd4j.linalg.api.blas.impl.BaseLevel3.gemm (BaseLevel3.java:77)
at org.nd4j.linalg.api.ndarray.BaseNDArray.mmuli (BaseNDArray.java:3156)
at org.deeplearning4j.nn.layers.BaseLayer.preOutputWithPreNorm (BaseLayer.java:316)
at org.deeplearning4j.nn.layers.BaseLayer.preOutput (BaseLayer.java:289)
at org.deeplearning4j.nn.layers.BaseLayer.activate (BaseLayer.java:337)
at org.deeplearning4j.nn.layers.AbstractLayer.activate (AbstractLayer.java:257)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.ffToLayerActivationsInWs (MultiLayerNetwork.java:1132)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.computeGradientAndScore (MultiLayerNetwork.java:2750)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.computeGradientAndScore (MultiLayerNetwork.java:2708)
at org.deeplearning4j.optimize.solvers.BaseOptimizer.gradientAndScore (BaseOptimizer.java:170)
at org.deeplearning4j.optimize.solvers.StochasticGradientDescent.optimize (StochasticGradientDescent.java:63)
at org.deeplearning4j.optimize.Solver.optimize (Solver.java:52)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fitHelper (MultiLayerNetwork.java:2309)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fit (MultiLayerNetwork.java:2267)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fit (MultiLayerNetwork.java:2330)

agibsonccc · March 27, 2020, 3:29am

@liweigu does this happen with the same model on an x86 box?

treo · March 27, 2020, 7:47am

@agibsonccc I guess the answer to your question is yes.

liweigu · March 27, 2020, 9:07am

It takes time to verify that, i’ll reply after the test.
But it’s ok on windows x86 for beta6.

liweigu · March 28, 2020, 10:07am

When building nd4j (in GitHub - KonduitAI/deeplearning4j: Eclipse Deeplearning4j, ND4J, DataVec and more - deep learning & linear algebra for Java/Scala with GPUs + Spark) with -Djavacpp.platform=linux-x86_64 on cpu server, i get error

[ERROR] Failed to execute goal on project nd4j-tensorflow: Could not resolve dependencies for project org.nd4j:nd4j-tensorflow:jar:1.0.0-SNAPSHOT: Failure to find org.bytedeco:tensorflow:jar:linux-x86_64:1.15.2-1.5.3-20200324.074704-327 in https://oss.sonatype.org/content/repositories/snapshots was cached in the local repository, resolution will not be reattempted until the update interval of sonatype-nexus-snapshots has elapsed or updates are forced → [Help 1]

It blocks me.

saudet · March 28, 2020, 2:46pm

We need to build with the --update-snapshots option as well, like mvn --update-snapshots -Djavacpp.platform=linux-x86_64 clean ...

liweigu · March 30, 2020, 3:07am

It can compile with --update-snapshots -Djavacpp.platform=linux-x86_64 .

And the testing program can train and predict with no NaN, it can print the result:
x + y = [[0.5548]]
It seems the NaN only occurs on arm server.
@treo @agibsonccc

raver119 · March 30, 2020, 4:28am

Very interesting.

Can you write simple test for matrix multiplication, or use one of our tests in order to check what’s going on there?

I.e. this test:

    public void testMMul() {
        INDArray arr = Nd4j.create(new double[][] {{1, 2, 3}, {4, 5, 6}});

        INDArray assertion = Nd4j.create(new double[][] {{14, 32}, {32, 77}});

        INDArray test = arr.mmul(arr.transpose());
        assertEquals(getFailureMessage(), assertion, test);
    }

liweigu · March 30, 2020, 6:56am

@raver119
The nd4j test is ok, it can print test’s value right.
assertion = [[ 14.0000, 32.0000],
[ 32.0000, 77.0000]]
test = [[ 14.0000, 32.0000],
[ 32.0000, 77.0000]]
( I changed the code to System.out.println("test = " + test); )

But my program still gets ND4JOpProfilerException as above in the same machine and in the same java project.

raver119 · March 30, 2020, 6:59am

Hmmm… Are you using some cloud provider? Is there any chance we could get access to similar ARM machine?

treo · March 30, 2020, 7:02am

From Run on ARM cpu server - #6 by liweigu

raver119 · March 30, 2020, 7:03am

Also: can you please print out your input data and labels before feeding them into neural network?

liweigu · March 30, 2020, 7:28am

For this line
int nEpochs = 500;
If set it to 50, there is no NaN, and it can print the result x + y = [[0.4639]]
So the NaN doesn’t come out in the 1st epoch on training.

liweigu · March 30, 2020, 7:34am

I put the data at

gist.github.com

https://gist.github.com/liweigu/30e7d1e525c3e34bb9ec99c943574118

the input and label

ds = ===========INPUT===================
[[    0.9206,    0.0396], 
 [    1.1601,    1.1761], 
 [    0.9884,    1.7857], 
 [    0.3592,    1.5785], 
 [    0.5957,    0.7707], 
 [    0.9643,    1.8737], 
 [    2.8688,    1.3088], 
 [    2.5304,    0.0485], 
 [    1.2180,    0.7743],

This file has been truncated. show original

raver119 · March 30, 2020, 7:39am

Hmmm… so it sounds like hardware-specific overflow then… We’ll definitely need to set up ARM machine and see what’s up there.

raver119 · March 30, 2020, 7:45am

@liweigu can you file a github issue please? We need to get to the bottom here

liweigu · March 30, 2020, 7:52am

OK.

github.com/eclipse/deeplearning4j

NaN on arm server

opened 07:52AM - 30 Mar 20 UTC

closed 09:02AM - 28 Dec 20 UTC

liweigu

Bug ARM

#### Issue Description Please describe our issue, along with: - expected beh…avior - encountered behavior NaN occors in https://gist.github.com/liweigu/bfcbd2a6e612e02ad7b24fee3cfac235 for epoch=500. It's ok for epoch=50. The detail is at https://community.konduit.ai/t/nan-on-arm-server/320 . #### Version Information Please indicate relevant versions, including, if relevant: * Deeplearning4j version 1.0.0-SNAPSHOT * Platform information (OS, etc) arm on Huawei * CUDA version, if used * NVIDIA driver version, if in use #### Additional Information Where applicable, please also provide: * Full log or exception stack trace (ideally in a Gist: gist.github.com) * pom.xml file or similar (also in a Gist) #### Contributing If you'd like to help us fix the issue by contributing some code, but would like guidance or help in doing so, please mention it!

raver119 · March 30, 2020, 9:25am

Thank you very much. We’ll check it out.

Topic		Replies	Views
NaNs present in prediction DL4J	3	1410	March 23, 2021
Simple CNN predicts NaNs DL4J	3	501	November 2, 2020
Arbiter: Cannot perform evaluation with NaNs present in predictions DL4J	3	531	November 22, 2020
TinyYOLO training goes into NaN with cuDNN DL4J	19	1698	April 15, 2020
Run on ARM cpu server DL4J	6	1434	February 26, 2020

NaN on arm server

Related topics