Hello, I have question about random seed. I am new to GPU training, so this will not be my last question :). If I set .seed(123) in FineTuneConfiguration and try to train model on CPU, it works as it should. With every new run with everything same i have same results. But on GPU are results every time different (with same seed). I also tried Nd4J.getRandom().setSeed() method, but didn’t help. On both machines I have same code. Do you know, what can be possible wrong? Am I missing something?
There are multiple reasons why that might be happening in order to figure out what the actual reason is we will need your help.
Are you using a convolutional network and are you using cuDNN? It has some numerical optimizations that aren’t deterministic and needs additional configuration to make it deterministic again.
If not, you might have found some kind of edge case. But we have a few tests for reproducability, so we are moderately confident that it should work as expected.
Thanks for your fast response.
I’m using convolutional network (TinyYOLO) and I’m using cuDNN. Can you help me with those configurations?
The easiest way to get that behavior is to just not use cudnn. But I guess that you don’t want to give up all the performance benefits it has.
Are you using tinyYolo from the model zoo?
Yes, from the zoo. Any ideas? I don’t want to get rid of a cudnn.
If I’m not mistaken, there is currently no easy way to do that.
The non-easy way is to get all layers in the model, and for each convolutional layer to set these options to the following settings:
There is already an issue to track the development of creating an easier way of doing this:
So I created new model, in this case MultiLayerNetwork. Set all parameters from pretrained model. Set cudnn configuration, but it is still not working, as I want. Every run are results different.
Any other ideas?
Let’s get a sanity check. Remove cuDNN and see if things work as expected.
Sanity check done.
Without cuDNN (got at startup “INFO: cuDNN not found: use cuDNN for better GPU performance by including the deeplearning4j-cuda module.”) are results different on every run.
I also tried CPU backend on same machine, which outputs same result on every run, as expected.
That is unexpected. Can you share your code with us so we can debug what is going on there?
Also, what are the specs of that machine?
OS: “Ubuntu 18.04.4 LTS”
uname -r: 4.15.0-88-generic
lspci | grep Tesla
65:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] (rev a1)
CPU: Intel® Xeon® Silver 4214 CPU @ 2.20GHz
Thanks for your time
I have reproduced the the problematic behavior. I had to make a few small changes to your code, as the project you’ve provided isn’t quite runnable on its own.
It looks like a legitimate bug.
I’ve opened an issue to track the progress: https://github.com/eclipse/deeplearning4j/issues/8732
Thanks again for your help. I’ll watch the issue .