TinyYOLO training goes into NaN with cuDNN

Hello again,

I found out strange thing. When training TinyYOLO with pretrained weights and if all data (all batches) has same dimensions, everything is fine. But if every batch has different data dimensions (same dimensions in actual batch, but different between other batches) training score goes into NaN on the 5th iteration, no matter on batch size.

I tested it on CPU backend → working well without NaN. CUDA backend (without cuDNN) → working well too. But CUDA with cuDNN is not working.

What do you think about it?

Are you running beta6 or the latest snapshots?

I’am running beta6, cuda 10.2

I see. Can you please try the latest snapshots, so if there’s something off we could debug it based on current master rather than beta6 release?

Instructions for using snapshots can be found here:
https://deeplearning4j.org/docs/latest/deeplearning4j-config-snapshots

Thank you guys. Tried with latest snapshots and it is working in same way. NaN at 5th iteration.

Before you start the 5th iteration, can you set Nd4j.getExecutioner().setProfilingMode(OpExecutioner.ProfilingMode.NAN_PANIC)? It will likely be very slow afterwards, because it checks if there is a NAN after each operation, but that way we can see where they appear.

I got this exception even in 0 iteration:
Exception in thread “main” org.nd4j.linalg.exception.ND4JOpProfilerException: P.A.N.I.C.! Op.Z() contains 24502 NaN value(s):
at org.nd4j.linalg.api.ops.executioner.OpExecutionerUtil.checkForNaN(OpExecutionerUtil.java:61)
at org.nd4j.linalg.api.ops.executioner.OpExecutionerUtil.checkForNaN(OpExecutionerUtil.java:133)
at org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner.profilingConfigurableHookOut(DefaultOpExecutioner.java:556)
at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.exec(CudaExecutioner.java:2227)
at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.exec(CudaExecutioner.java:2015)
at org.nd4j.linalg.factory.Nd4j.exec(Nd4j.java:6521)
at org.nd4j.linalg.api.ndarray.BaseNDArray.divi(BaseNDArray.java:3190)
at org.nd4j.linalg.api.ndarray.BaseNDArray.div(BaseNDArray.java:3007)
at org.deeplearning4j.nn.layers.objdetect.Yolo2OutputLayer.calculateIOULabelPredicted(Yolo2OutputLayer.java:496)
at org.deeplearning4j.nn.layers.objdetect.Yolo2OutputLayer.computeBackpropGradientAndScore(Yolo2OutputLayer.java:184)
at org.deeplearning4j.nn.layers.objdetect.Yolo2OutputLayer.backpropGradient(Yolo2OutputLayer.java:103)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.calcBackpropGradients(MultiLayerNetwork.java:1944)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.computeGradientAndScore(MultiLayerNetwork.java:2765)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.computeGradientAndScore(MultiLayerNetwork.java:2708)
at org.deeplearning4j.optimize.solvers.BaseOptimizer.gradientAndScore(BaseOptimizer.java:170)
at org.deeplearning4j.optimize.solvers.StochasticGradientDescent.optimize(StochasticGradientDescent.java:63)
at org.deeplearning4j.optimize.Solver.optimize(Solver.java:52)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fitHelper(MultiLayerNetwork.java:1713)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fit(MultiLayerNetwork.java:1634)
at eu.extech.deep_learning.trainers.MultiLayerNetworkTrainer.fit(MultiLayerNetworkTrainer.java:24)
at eu.extech.deep_learning.trainers.NeuralNetworkTrainer.run(NeuralNetworkTrainer.java:61)
at eu.extech.deep_learning.computer_vision.ProductDetectionTrainingTest.main(ProductDetectionTrainingTest.java:105)

Is this what you want?

I also tried setProfilingMode just before 5th iteration and it gave me little bit different exception with much more NaN values. Do you want this exception too?

That looks suspicious. Can you provide us with a minimal demo project that produces the NaNs?

The inputs should be as close to the original as possible in shape, but it would be fine for them to be random otherwise.

There it is:

Of course this is not, what I am using, but it is acting similar way. I hope code will be easier to run then last time :slight_smile:.
Thank you for your time.

Sorry for letting you wait so long.

I’ve finally gotten around to check this, and it is because the Yolo output layer actually uses NaN’s as part of the computation. This makes the search for where it goes to NaN later on quite a bit harder.

It appears that in cuDNNs batch normalization the gradients just overflow at some point:

As you can see here, they overflow in all kinds of directions, and everything below them thereby becomes quite useless.

When using the CPU backend this obviously doesn’t appear to happen:


Using just cuda without cudnn, I couldn’t run the example until it got NaNs, as I ran out of GPU memory.

@AlexBlack @raver119 any ideas why that might be happening? Initially I thought it might be some kind of bug related to gradient renormalization with cudnn, but there are no huge gradients on the first few iterations.

@Rolnire Have you sovled this problem?

I encounter the same problem with YOLOv2, get NaN or get the exception below (by setting ProfilingMode.NAN_PANIC).

I tried to remove cudnn by rename /usr/include/cudnn.h . (There is no other cudnn.h, and no cudnn jars in PATH or LD_LIBRARY_PATH),
and removed cuda-platform-redist from pom.xml.

The exception is:

Exception in thread “main” org.nd4j.linalg.exception.ND4JOpProfilerException: P.A.N.I.C.! Op.Z() contains 7045 NaN value(s):
at org.nd4j.linalg.api.ops.executioner.OpExecutionerUtil.checkForNaN(OpExecutionerUtil.java:68)
at org.nd4j.linalg.api.ops.executioner.OpExecutionerUtil.checkForNaN(OpExecutionerUtil.java:140)
at org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner.profilingConfigurableHookOut(DefaultOpExecutioner.java:557)
at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.exec(CudaExecutioner.java:2512)
at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.exec(CudaExecutioner.java:2306)
at org.nd4j.linalg.factory.Nd4j.exec(Nd4j.java:6599)
at org.nd4j.linalg.api.ndarray.BaseNDArray.divi(BaseNDArray.java:3191)
at org.nd4j.linalg.api.ndarray.BaseNDArray.div(BaseNDArray.java:3008)
at org.deeplearning4j.nn.layers.objdetect.Yolo2OutputLayer.calculateIOULabelPredicted(Yolo2OutputLayer.java:496)
at org.deeplearning4j.nn.layers.objdetect.Yolo2OutputLayer.computeBackpropGradientAndScore(Yolo2OutputLayer.java:184)
at org.deeplearning4j.nn.layers.objdetect.Yolo2OutputLayer.backpropGradient(Yolo2OutputLayer.java:103)

Hi, I still haven’t. I think these NaNs are part of computations as @treo said. Some issue can be with batch normalization. For now, if I’m training TinyYOLO, sometimes i got score NaN, but next epoch I got proper number and so on. What helped me little bit was increase batch size from 24 to 32 and using SNAPSHOT version. If you find out something let me know :slight_smile:

@liweigu
You can disable the use of cuDNN by removing the deeplearning4j-cuda-10.2 (or whatever cuda version you are using) dependency.

Well, using nd4j-native-platform (cpu) instead of cuda is all right.

Might not be related but, I once encountered NaNs on GPU way faster than on the CPU. I think it was due to differences in the algorithms/optimizations used. It turned out that my data was simply not normalized properly. I’d double check the input data just in case.

Hi Alex, I have checkout the snapshot version of dl4j, but now my IDE wants to checkout 1G all day, can I fix the SNAPSHOT to a specific day in pom.xml?

6 posts were split to a new topic: TinyYolo training goes into NaN with CPU Backend

Not really, but we can do mvn --no-snapshot-updates ... to prevent it from downloading new versions every day.