I found out strange thing. When training TinyYOLO with pretrained weights and if all data (all batches) has same dimensions, everything is fine. But if every batch has different data dimensions (same dimensions in actual batch, but different between other batches) training score goes into NaN on the 5th iteration, no matter on batch size.
I tested it on CPU backend → working well without NaN. CUDA backend (without cuDNN) → working well too. But CUDA with cuDNN is not working.
Before you start the 5th iteration, can you set Nd4j.getExecutioner().setProfilingMode(OpExecutioner.ProfilingMode.NAN_PANIC)? It will likely be very slow afterwards, because it checks if there is a NAN after each operation, but that way we can see where they appear.
I got this exception even in 0 iteration:
Exception in thread “main” org.nd4j.linalg.exception.ND4JOpProfilerException: P.A.N.I.C.! Op.Z() contains 24502 NaN value(s):
at org.nd4j.linalg.api.ops.executioner.OpExecutionerUtil.checkForNaN(OpExecutionerUtil.java:61)
at org.nd4j.linalg.api.ops.executioner.OpExecutionerUtil.checkForNaN(OpExecutionerUtil.java:133)
at org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner.profilingConfigurableHookOut(DefaultOpExecutioner.java:556)
at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.exec(CudaExecutioner.java:2227)
at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.exec(CudaExecutioner.java:2015)
at org.nd4j.linalg.factory.Nd4j.exec(Nd4j.java:6521)
at org.nd4j.linalg.api.ndarray.BaseNDArray.divi(BaseNDArray.java:3190)
at org.nd4j.linalg.api.ndarray.BaseNDArray.div(BaseNDArray.java:3007)
at org.deeplearning4j.nn.layers.objdetect.Yolo2OutputLayer.calculateIOULabelPredicted(Yolo2OutputLayer.java:496)
at org.deeplearning4j.nn.layers.objdetect.Yolo2OutputLayer.computeBackpropGradientAndScore(Yolo2OutputLayer.java:184)
at org.deeplearning4j.nn.layers.objdetect.Yolo2OutputLayer.backpropGradient(Yolo2OutputLayer.java:103)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.calcBackpropGradients(MultiLayerNetwork.java:1944)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.computeGradientAndScore(MultiLayerNetwork.java:2765)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.computeGradientAndScore(MultiLayerNetwork.java:2708)
at org.deeplearning4j.optimize.solvers.BaseOptimizer.gradientAndScore(BaseOptimizer.java:170)
at org.deeplearning4j.optimize.solvers.StochasticGradientDescent.optimize(StochasticGradientDescent.java:63)
at org.deeplearning4j.optimize.Solver.optimize(Solver.java:52)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fitHelper(MultiLayerNetwork.java:1713)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fit(MultiLayerNetwork.java:1634)
at eu.extech.deep_learning.trainers.MultiLayerNetworkTrainer.fit(MultiLayerNetworkTrainer.java:24)
at eu.extech.deep_learning.trainers.NeuralNetworkTrainer.run(NeuralNetworkTrainer.java:61)
at eu.extech.deep_learning.computer_vision.ProductDetectionTrainingTest.main(ProductDetectionTrainingTest.java:105)
Is this what you want?
I also tried setProfilingMode just before 5th iteration and it gave me little bit different exception with much more NaN values. Do you want this exception too?
I’ve finally gotten around to check this, and it is because the Yolo output layer actually uses NaN’s as part of the computation. This makes the search for where it goes to NaN later on quite a bit harder.
Using just cuda without cudnn, I couldn’t run the example until it got NaNs, as I ran out of GPU memory.
@AlexBlack@raver119 any ideas why that might be happening? Initially I thought it might be some kind of bug related to gradient renormalization with cudnn, but there are no huge gradients on the first few iterations.
I encounter the same problem with YOLOv2, get NaN or get the exception below (by setting ProfilingMode.NAN_PANIC).
I tried to remove cudnn by rename /usr/include/cudnn.h . (There is no other cudnn.h, and no cudnn jars in PATH or LD_LIBRARY_PATH),
and removed cuda-platform-redist from pom.xml.
The exception is:
Exception in thread “main” org.nd4j.linalg.exception.ND4JOpProfilerException: P.A.N.I.C.! Op.Z() contains 7045 NaN value(s):
at org.nd4j.linalg.api.ops.executioner.OpExecutionerUtil.checkForNaN(OpExecutionerUtil.java:68)
at org.nd4j.linalg.api.ops.executioner.OpExecutionerUtil.checkForNaN(OpExecutionerUtil.java:140)
at org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner.profilingConfigurableHookOut(DefaultOpExecutioner.java:557)
at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.exec(CudaExecutioner.java:2512)
at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.exec(CudaExecutioner.java:2306)
at org.nd4j.linalg.factory.Nd4j.exec(Nd4j.java:6599)
at org.nd4j.linalg.api.ndarray.BaseNDArray.divi(BaseNDArray.java:3191)
at org.nd4j.linalg.api.ndarray.BaseNDArray.div(BaseNDArray.java:3008)
at org.deeplearning4j.nn.layers.objdetect.Yolo2OutputLayer.calculateIOULabelPredicted(Yolo2OutputLayer.java:496)
at org.deeplearning4j.nn.layers.objdetect.Yolo2OutputLayer.computeBackpropGradientAndScore(Yolo2OutputLayer.java:184)
at org.deeplearning4j.nn.layers.objdetect.Yolo2OutputLayer.backpropGradient(Yolo2OutputLayer.java:103)
Hi, I still haven’t. I think these NaNs are part of computations as @treo said. Some issue can be with batch normalization. For now, if I’m training TinyYOLO, sometimes i got score NaN, but next epoch I got proper number and so on. What helped me little bit was increase batch size from 24 to 32 and using SNAPSHOT version. If you find out something let me know
Might not be related but, I once encountered NaNs on GPU way faster than on the CPU. I think it was due to differences in the algorithms/optimizations used. It turned out that my data was simply not normalized properly. I’d double check the input data just in case.
Hi Alex, I have checkout the snapshot version of dl4j, but now my IDE wants to checkout 1G all day, can I fix the SNAPSHOT to a specific day in pom.xml?