I am a beginner of DL4J project. Recently, I profile the dl4j by dl4j-examples project. I want to figure out the following two problems:
First, I train the network in org.deeplearning4j.examples.quickstart.modeling.convolution.LeNetMNIST in both Pytorch & DL4J (keep the same network layers), and I profile the two processes with /usr/bin/time toolchain. I discover that the time in DL4J is pretty longer than Pytorch. Meantime, the time in systen kernel (systime) is even longer than user time (usrtime). What reasons cause this? Maybe there are many times syscall?
Secondly, I profile the LenetMNIST.java by JFR (java flight recorder), the top of the flame graph shows that the CPU is mainly occupied by execCustomOp2. I want to figure out that the overhead for invoking the corresponding native methods. How can I do this?
@yaoyao did you run the example multiple times? On startup we have to download the dataset and then we cache it locally.
Also, just for reproducibility what version are you using of pytorch and dl4j? Was it the master branch like your github issue? I will try to figure out the difference there. Could you repost your original flame graph?
There might be some quick wins here.
If I had to guess it might be due to allocation/GC. There have been quite a few performance improvements in the current master branch I’d love to see if we can test here since you’re already compiling from source.
Do you mind helping with that? I’d really appreciate it. Thanks!