Random seed with GPU backend

Rolnire · February 23, 2020, 5:31pm

Hello, I have question about random seed. I am new to GPU training, so this will not be my last question :). If I set .seed(123) in FineTuneConfiguration and try to train model on CPU, it works as it should. With every new run with everything same i have same results. But on GPU are results every time different (with same seed). I also tried Nd4J.getRandom().setSeed() method, but didn’t help. On both machines I have same code. Do you know, what can be possible wrong? Am I missing something?

treo · February 23, 2020, 6:52pm

There are multiple reasons why that might be happening in order to figure out what the actual reason is we will need your help.

Are you using a convolutional network and are you using cuDNN? It has some numerical optimizations that aren’t deterministic and needs additional configuration to make it deterministic again.

If not, you might have found some kind of edge case. But we have a few tests for reproducability, so we are moderately confident that it should work as expected.

Rolnire · February 23, 2020, 6:57pm

Thanks for your fast response.
I’m using convolutional network (TinyYOLO) and I’m using cuDNN. Can you help me with those configurations?

treo · February 23, 2020, 7:44pm

The easiest way to get that behavior is to just not use cudnn. But I guess that you don’t want to give up all the performance benefits it has.

Are you using tinyYolo from the model zoo?

Rolnire · February 23, 2020, 7:50pm

Yes, from the zoo. Any ideas? I don’t want to get rid of a cudnn.

treo · February 23, 2020, 7:59pm

If I’m not mistaken, there is currently no easy way to do that.

The non-easy way is to get all layers in the model, and for each convolutional layer to set these options to the following settings:
cudnnAlgoMode: USER_SPECIFIED
cudnnBwdDataMode: ALGO_1
cudnnBwdFilterMode: ALGO_1

There is already an issue to track the development of creating an easier way of doing this:

github.com/eclipse/deeplearning4j

Add CuDNN deterministic mode

opened 05:04AM - 08 May 18 UTC

AlexDBlack

CuDNN implements a number of different algorithms for CNN etc, and (based on our… use of CuDNN in DL4J) automatically selects which algorithm to use. However, many of those algorithms are non-deterministic (but may be faster than the deterministic implementations) - this non-determinism can make it challenging to debug issues that require reproducibility between runs. Here's the scores of the same net run 5 times on the same data with cudnn: ![image](https://user-images.githubusercontent.com/2360237/39738781-7a8e3cc6-52d0-11e8-9064-18fdfcaa0075.png) Config from here (minus dropout): https://github.com/deeplearning4j/deeplearning4j/issues/5068 See also, for example: https://github.com/pytorch/pytorch/issues/114 https://github.com/tensorflow/tensorflow/issues/12871

Rolnire · February 24, 2020, 11:28am

So I created new model, in this case MultiLayerNetwork. Set all parameters from pretrained model. Set cudnn configuration, but it is still not working, as I want. Every run are results different.

Any other ideas?

treo · February 24, 2020, 11:37am

Let’s get a sanity check. Remove cuDNN and see if things work as expected.

Rolnire · February 24, 2020, 12:34pm

Sanity check done.

Without cuDNN (got at startup “INFO: cuDNN not found: use cuDNN for better GPU performance by including the deeplearning4j-cuda module.”) are results different on every run.
I also tried CPU backend on same machine, which outputs same result on every run, as expected.

treo · February 24, 2020, 12:39pm

That is unexpected. Can you share your code with us so we can debug what is going on there?

Also, what are the specs of that machine?

Rolnire · February 24, 2020, 1:21pm

Machine specs:
OS: “Ubuntu 18.04.4 LTS”
uname -r: 4.15.0-88-generic

lspci | grep Tesla
65:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] (rev a1)

CPU: Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz

There is my code: https://drive.google.com/file/d/1aNrrx0utvxVijdVRY4nZ0PqoR2OybzM0/view?usp=sharing

Thanks for your time

treo · February 24, 2020, 2:46pm

I have reproduced the the problematic behavior. I had to make a few small changes to your code, as the project you’ve provided isn’t quite runnable on its own.

It looks like a legitimate bug.

I’ve opened an issue to track the progress: CUDA behaves non-deterministically even without cuDNN · Issue #8732 · eclipse/deeplearning4j · GitHub

Rolnire · February 24, 2020, 2:55pm

Thanks again for your help. I’ll watch the issue .

Topic		Replies	Views
Reproducibility question DL4J	4	359	May 29, 2021
Set up deeplearning4j to use cuDNN for training convolutional neural net DL4J	5	57	September 9, 2024
TinyYOLO training goes into NaN with cuDNN DL4J	19	1698	April 15, 2020
TinyYolo training goes into NaN with CPU Backend DL4J	6	982	April 16, 2020
25% GPU Usage on 1080 ti DL4J	9	585	February 14, 2022

Random seed with GPU backend

Related topics