TinyYOLO training goes into NaN with cuDNN

Rolnire · February 27, 2020, 11:40am

Hello again,

I found out strange thing. When training TinyYOLO with pretrained weights and if all data (all batches) has same dimensions, everything is fine. But if every batch has different data dimensions (same dimensions in actual batch, but different between other batches) training score goes into NaN on the 5th iteration, no matter on batch size.

I tested it on CPU backend → working well without NaN. CUDA backend (without cuDNN) → working well too. But CUDA with cuDNN is not working.

What do you think about it?

raver119 · February 27, 2020, 11:45am

Are you running beta6 or the latest snapshots?

Rolnire · February 27, 2020, 11:45am

I’am running beta6, cuda 10.2

raver119 · February 27, 2020, 11:46am

I see. Can you please try the latest snapshots, so if there’s something off we could debug it based on current master rather than beta6 release?

AlexBlack · February 27, 2020, 11:49am

Instructions for using snapshots can be found here:
https://deeplearning4j.org/docs/latest/deeplearning4j-config-snapshots

Rolnire · February 27, 2020, 12:12pm

Thank you guys. Tried with latest snapshots and it is working in same way. NaN at 5th iteration.

treo · February 27, 2020, 12:32pm

Before you start the 5th iteration, can you set Nd4j.getExecutioner().setProfilingMode(OpExecutioner.ProfilingMode.NAN_PANIC)? It will likely be very slow afterwards, because it checks if there is a NAN after each operation, but that way we can see where they appear.

Rolnire · February 27, 2020, 12:45pm

I got this exception even in 0 iteration:
Exception in thread “main” org.nd4j.linalg.exception.ND4JOpProfilerException: P.A.N.I.C.! Op.Z() contains 24502 NaN value(s):
at org.nd4j.linalg.api.ops.executioner.OpExecutionerUtil.checkForNaN(OpExecutionerUtil.java:61)
at org.nd4j.linalg.api.ops.executioner.OpExecutionerUtil.checkForNaN(OpExecutionerUtil.java:133)
at org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner.profilingConfigurableHookOut(DefaultOpExecutioner.java:556)
at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.exec(CudaExecutioner.java:2227)
at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.exec(CudaExecutioner.java:2015)
at org.nd4j.linalg.factory.Nd4j.exec(Nd4j.java:6521)
at org.nd4j.linalg.api.ndarray.BaseNDArray.divi(BaseNDArray.java:3190)
at org.nd4j.linalg.api.ndarray.BaseNDArray.div(BaseNDArray.java:3007)
at org.deeplearning4j.nn.layers.objdetect.Yolo2OutputLayer.calculateIOULabelPredicted(Yolo2OutputLayer.java:496)
at org.deeplearning4j.nn.layers.objdetect.Yolo2OutputLayer.computeBackpropGradientAndScore(Yolo2OutputLayer.java:184)
at org.deeplearning4j.nn.layers.objdetect.Yolo2OutputLayer.backpropGradient(Yolo2OutputLayer.java:103)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.calcBackpropGradients(MultiLayerNetwork.java:1944)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.computeGradientAndScore(MultiLayerNetwork.java:2765)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.computeGradientAndScore(MultiLayerNetwork.java:2708)
at org.deeplearning4j.optimize.solvers.BaseOptimizer.gradientAndScore(BaseOptimizer.java:170)
at org.deeplearning4j.optimize.solvers.StochasticGradientDescent.optimize(StochasticGradientDescent.java:63)
at org.deeplearning4j.optimize.Solver.optimize(Solver.java:52)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fitHelper(MultiLayerNetwork.java:1713)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fit(MultiLayerNetwork.java:1634)
at eu.extech.deep_learning.trainers.MultiLayerNetworkTrainer.fit(MultiLayerNetworkTrainer.java:24)
at eu.extech.deep_learning.trainers.NeuralNetworkTrainer.run(NeuralNetworkTrainer.java:61)
at eu.extech.deep_learning.computer_vision.ProductDetectionTrainingTest.main(ProductDetectionTrainingTest.java:105)

Is this what you want?

I also tried setProfilingMode just before 5th iteration and it gave me little bit different exception with much more NaN values. Do you want this exception too?

treo · February 27, 2020, 12:58pm

That looks suspicious. Can you provide us with a minimal demo project that produces the NaNs?

The inputs should be as close to the original as possible in shape, but it would be fine for them to be random otherwise.

Rolnire · February 27, 2020, 1:37pm

There it is:

Of course this is not, what I am using, but it is acting similar way. I hope code will be easier to run then last time .
Thank you for your time.

treo · March 2, 2020, 5:07pm

Sorry for letting you wait so long.

I’ve finally gotten around to check this, and it is because the Yolo output layer actually uses NaN’s as part of the computation. This makes the search for where it goes to NaN later on quite a bit harder.

treo · March 2, 2020, 7:05pm

It appears that in cuDNNs batch normalization the gradients just overflow at some point:

As you can see here, they overflow in all kinds of directions, and everything below them thereby becomes quite useless.

When using the CPU backend this obviously doesn’t appear to happen:

Using just cuda without cudnn, I couldn’t run the example until it got NaNs, as I ran out of GPU memory.

@AlexBlack @raver119 any ideas why that might be happening? Initially I thought it might be some kind of bug related to gradient renormalization with cudnn, but there are no huge gradients on the first few iterations.

liweigu · March 4, 2020, 7:19am

@Rolnire Have you sovled this problem?

I encounter the same problem with YOLOv2, get NaN or get the exception below (by setting ProfilingMode.NAN_PANIC).

I tried to remove cudnn by rename /usr/include/cudnn.h . (There is no other cudnn.h, and no cudnn jars in PATH or LD_LIBRARY_PATH),
and removed cuda-platform-redist from pom.xml.

The exception is:

Exception in thread “main” org.nd4j.linalg.exception.ND4JOpProfilerException: P.A.N.I.C.! Op.Z() contains 7045 NaN value(s):
at org.nd4j.linalg.api.ops.executioner.OpExecutionerUtil.checkForNaN(OpExecutionerUtil.java:68)
at org.nd4j.linalg.api.ops.executioner.OpExecutionerUtil.checkForNaN(OpExecutionerUtil.java:140)
at org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner.profilingConfigurableHookOut(DefaultOpExecutioner.java:557)
at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.exec(CudaExecutioner.java:2512)
at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.exec(CudaExecutioner.java:2306)
at org.nd4j.linalg.factory.Nd4j.exec(Nd4j.java:6599)
at org.nd4j.linalg.api.ndarray.BaseNDArray.divi(BaseNDArray.java:3191)
at org.nd4j.linalg.api.ndarray.BaseNDArray.div(BaseNDArray.java:3008)
at org.deeplearning4j.nn.layers.objdetect.Yolo2OutputLayer.calculateIOULabelPredicted(Yolo2OutputLayer.java:496)
at org.deeplearning4j.nn.layers.objdetect.Yolo2OutputLayer.computeBackpropGradientAndScore(Yolo2OutputLayer.java:184)
at org.deeplearning4j.nn.layers.objdetect.Yolo2OutputLayer.backpropGradient(Yolo2OutputLayer.java:103)

Rolnire · March 4, 2020, 8:26am

Hi, I still haven’t. I think these NaNs are part of computations as @treo said. Some issue can be with batch normalization. For now, if I’m training TinyYOLO, sometimes i got score NaN, but next epoch I got proper number and so on. What helped me little bit was increase batch size from 24 to 32 and using SNAPSHOT version. If you find out something let me know

treo · March 4, 2020, 8:45am

@liweigu
You can disable the use of cuDNN by removing the deeplearning4j-cuda-10.2 (or whatever cuda version you are using) dependency.

liweigu · March 4, 2020, 9:21am

Well, using nd4j-native-platform (cpu) instead of cuda is all right.

eduardo · March 4, 2020, 9:25am

Might not be related but, I once encountered NaNs on GPU way faster than on the CPU. I think it was due to differences in the algorithms/optimizations used. It turned out that my data was simply not normalized properly. I’d double check the input data just in case.

thhart · April 3, 2020, 6:46pm

Hi Alex, I have checkout the snapshot version of dl4j, but now my IDE wants to checkout 1G all day, can I fix the SNAPSHOT to a specific day in pom.xml?

treo · April 6, 2020, 3:01pm

6 posts were split to a new topic: TinyYolo training goes into NaN with CPU Backend

saudet · April 15, 2020, 10:07am

Not really, but we can do mvn --no-snapshot-updates ... to prevent it from downloading new versions every day.

Topic		Replies	Views
TinyYolo training goes into NaN with CPU Backend DL4J	6	982	April 16, 2020
Tiny YOLO PredictedObjects NaN DL4J	14	2131	June 18, 2024
Simple CNN predicts NaNs DL4J	3	503	November 2, 2020
NaNs present in prediction DL4J	3	1412	March 23, 2021
NaN on arm server DL4J	20	1467	December 28, 2020

TinyYOLO training goes into NaN with cuDNN

Related topics