@treo Hi, I am following this thread with interest and I am testing a lot with different datasets and I have observed strange learning curves and NaN values as well. It is only a feeling but I think it is not related to cuDNN in general but more deeply in the computational layers of Yolo, it rather looks to CPU computation bugs to me but this is a wild guess. I am limiting the batch size to two for a specific dataset and then the learning goes well to thousand of iterations. However even then it is luck how the model is performing, something is broken inside for sure. I am pretty convinced it is in the Yolo structure somewhere…
You can remove the snapshot repository from your pom.xml again, and then it will use what is cached in your local maven repository.
At the moment we’ve only been able to reproduce this with cuDNN, so we have to assume that it is because of that. Numerical overflow/underflow issues do happen given specific hardware implementations. For, example on hardware ARM systems we sometimes run into those issues as well, while with a virtualized ARM system we don’t get those problems.
Our yolo implementation appears to work fine in our test cases, so we have to assume that it is working as intended on CPU. Hunting a bug that is hard to pinpoint requires a lot of resources, and we are still limited on that.
If you want to help with that, we will answer any questions that come up, but unless we have a better way to pin point the issue, we will have to prioritize the bugs that we have already confirmed to be fixable from our side.
thanks for the explanation and details, I appreciate really. I will try to collect some more observations and let you know. For now I can drive my models with very limited batch size (2) to several thousand iterations and the results are ok. However the bad taste remains, maybe it is a memory issue still. IMHO the object detection is a very important part and would help DL4J to remain important. But I know how dynamic all this is right now.
Just to make sure that I did get that right: Are you getting NaN problems with the CPU backend with larger batches, too? Or is it only cuDNN?
I get it with the CPU backend and larger batches.
Ah, I see. Let’s split this off into a separate thread then. There are multiple reasons why you might run into NaN’s, and we’ve got to figure out if you do run into them because of a configuration issue or something else.
Can you share anything about your project that we can use to reproduce the behavior?
Hi Treo, thanks for your help so far, as always it is not easy to share this. But will come back as soon as possible and extract my model in an exportable format…