Perfomance issue

Hello everyone!

I made some NN and did some experiments.

I ran it on my local machine (4 cores, 8 threads, 8 GB memory + SSD) and there are perfomance log and results:
Time: about 50 minutes
Accuracy: 83,7 %

Next I ran the same NN in Google Cloud on VM with 1 core; 6,5 GB memory:
Time: about 40 minutes
Accuracy : the same 83,7%
SCREENSHOTS (DARK THEME IS LOCAL MACHINE / WHITE IS VIRTUAL IN GC)
So the question is: why is the differents between time training , moreover the result is better where perfomance is worse.

In addition: how I can increase perfomance , because the same NN in TF make training in 10 minutes what is in 4 times better that dl4j.
Maybe I can turn on more threads? Cause I have 8 but used only 4 or what ?
Really do not understand this differents :neutral_face:

If we can’t run it ourselves, we can’t tell you why it’s faster/slower :slight_smile:
Could you give us the complete information? We’d need to be able to setup tensorflow as well.
Imagine everything we’d need to run a benchmark ourselves, the 2 projects side by side, the versions of everything you used, what OSes you used for each as well as how you ran each one.

Screenshots from some logs and “tensorflow was faster and it was roughly something similar” doesn’t really help us determine much.

Let’s leave information about TF now.

Whats worng with two examples on different machines?
What also information you need to answer ?
NN is absolutely same.
VM has less perfomance but get the best result…why?

@SaviorD7 I need literally everything you used to run it:

  1. Source code
  2. Java command you use to run it.
  3. If you have it, a similar dataset not the exact one

If I can’t reproduce it I can’t really help you. What you’re essentially asking me to do is download your code, run it myself, run instrumentation tools to identify the bottlenecks and compare them.

I’m not going to guess, I’m going to measure and verify. After that I can offer suggestions and see what differences might affect your code running.

If you want, you can also add in: https://github.com/eclipse/deeplearning4j/blob/master/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/optimize/listeners/PerformanceListener.java

to measure performance and compare them as well.

As for the machines, it’d be nice to know which size of VM you had on GCP as well.

Again, I can’t answer much of “why” without running it myself.

So,

  1. Source code: https://pastebin.com/cTYEnyAs
  2. What exaclty you need? Using: C:\Program Files\Java\jdk-14.0.2\bin\java.exe"
  3. Dataset: https://drive.google.com/drive/folders/14TXMmKbnF_oKBfnx4RJ1V2MDUkSs3cvk

@SaviorD7 beautiful thanks let me take a look. I’ll try to figure out what’s going on.

Edit: @SaviorD7 could you tell me the exact command you used like

java -cp somejar.jar your.main.class

I need to know that to know how to run your program. One big issue you’re likely running in to is memory constraints. Sometimes java sets default heap sizes that constrain performance.
One thing most people don’t realize is python isn’t afraid to infinitely expand the memory usage of a program till the computer dies.
Java’s GC constrains itself and that causes performance regressions rather than killing a machine.

Honestly though, this is just a guess. Without me being able to run things for myself I can’t really help you.

You also said “put tensorflow aside for a moment” actually it’d be really helpful to run something side by side. If something is slower, we should know about it and at least have it be a known issue.

My shorten coomand is:

About TF, I made the same NN in TF and dl4j. Results in dl4j you can see above.
Also NN in TF trained about 4 times faster (10 - 15, max 20 minutes).

Right but do you have a specific example for me? I’d still prefer to run it myself for comparison.

@SaviorD7 so playing with your script a bit…is there any reason your batch size was so small? I still highly doubt you were running something equivalent. I changed the batch size to 1000 and it finished pretty quick. You have more than enough data to increase the batch size. There’s zero reason to have it at 25.
I’d really like to see your TF script. Also, please use the performance listener.
Here’s my modified version that does that: https://gist.github.com/agibsonccc/d232723ef36d8cb425c58c8750dec50e