Multiple NN copies running in Scala Futures - Benchmarking

Using: beta-7 version.

I’m running a program where I have copies of the same NN performing inference using Scala Futures. The copies are created using model.clone(). Each future gets it’s own model copy. I’m comparing this to sequential inference on a single NN. I’m finding that the parallelized version’s nn execution individually is 10x slower than the sequential NN execution. Overall the parallel is faster but the individual execution of the NN is slower.

The model is a CNN and I’m only measuring the time for model.output(input) where the input is an INDArray which has already been created.

I’m trying to figure out if it has something to do specifically with dl4j memory management or more to do with the fact that everything is parallelized.

Here are some of the results. As can be seen the average for the Parallel is 0.01s and the average for the sequential is 0.001s. So 10x Faster. I am warming up the NNs in both cases by running the same test 100 times before doing the measurement. If I reduce the number of parallel threads, the speed of the individual NN call seems to go down.

========= Parallel ===============
NN feed forward: 0.009698113
NN feed forward: 0.010438089
NN feed forward: 0.010443257
NN feed forward: 0.01072912
NN feed forward: 0.01133849
NN feed forward: 0.011053186
NN feed forward: 0.012083145
NN feed forward: 0.012296932
NN feed forward: 0.012797978
NN feed forward: 0.013558766
NN feed forward: 0.013470836
NN feed forward: 0.013198661
NN feed forward: 0.01342439
NN feed forward: 0.012691608
NN feed forward: 0.014100041
NN feed forward: 0.013980262
NN feed forward: 0.014412235
Final time: 0.017440721

======= Sequential =====================
NN feed forward: 0.001774196
NN feed forward: 0.001696182
NN feed forward: 0.001617609
NN feed forward: 0.001589184
NN feed forward: 0.001603432
NN feed forward: 0.001599171
NN feed forward: 0.001599171
NN feed forward: 0.001581222
NN feed forward: 0.001615235
NN feed forward: 0.001609089
NN feed forward: 0.001600568
NN feed forward: 0.001610974
NN feed forward: 0.0016086
NN feed forward: 0.001589673
NN feed forward: 0.001581153
NN feed forward: 0.001580733
NN feed forward: 0.001605667
Final time: 0.032481752

If needed I can provide the code for reproducing.

When models are executed there is also some parallelism in those calculations. You will most likely want to disable that too. Set the OMP_NUM_THREADS=1 environment variable and see if it is any faster for you.

Also: When executing sequentially, put all of your requests into one single batch, and time that as well. It will probably be even faster overall.

Okay will stop the parallelism in the models with that flag.

Oh that is a great idea. Will test that out as well.

Yes, the batch method is indeed the fastest. Since the bottleneck is the NN.output method. Calling it once is much faster. Even with the creation of the inputs INDArrays and vstacking them.

When I run OMP_NUM_THREADS=1 I end up with a the following error:

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007f8fe7396470, pid=606931, tid=0x00007f8fc95f5700

JRE version: Java™ SE Runtime Environment (8.0_211-b12) (build 1.8.0_211-b12)

Java VM: Java HotSpot™ 64-Bit Server VM (25.211-b12 mixed mode linux-amd64 compressed oops)

Problematic frame:

C [libnd4jcpu.so+0x56d2470] samediff::ticket::acquiredThreads(unsigned int)+0x0

Failed to write core dump. Core dumps have been disabled. To enable core dumping, try “ulimit -c unlimited” before starting Java again

@sachag678 could you give us the his_error_*pid.log that was output?
That generally arises from multi threading + OMP_NUM_THREADS > 1

I was recreating the example for testing purposes using generic code and I can’t seem to reproduce the issue without some of the proprietary code. I have attached the .log file but have removed the mentions of the proprietary code. Hope that is okay.

The code was multithreading using Futures and I set OMP_NUM_THREADS=1. One thing to note was I am explicitly creating an execution context in the code in the following manner.

implicit val ec = new ExecutionContext {
val threadPool = Executors.newFixedThreadPool(12)

def execute(runnable: Runnable) {
  threadPool.submit(runnable)
}

def reportFailure(t: Throwable) {}

}

Also since this issue is more solved in the above manner - no worries if this is not looked into further.

One question I had was to do with the memory management of dl4j - if there are multiple copies of the same model in multiple threads how does the workspaces memory model handle that?

Thanks!

@sachag678 don’t roll your own thread pools. Use ParallelWrapper:

or for inference ParallelInference:

Okay, thanks. In the documentation it mentions that ParallelWrapper will handle load across GPUS - I only have a single GPU. It probably won’t parallelize then yes?

@sachag678 it’s also used for multi threading. If you want you can just use 1 worker but leverage the thread pool built in there.

1 Like