Unexpected Slow Performance

I am facing slow performance, contrary to what I believe it should be. I have a 8core Ryzen 3700x and 16GB of DDR4 RAM.
I expected kinda faster calculations but the java process doesn’t go more than 10% and about 500MB+ RAM.
I can’t do many modifications on the code but as far as I know, on native Matlab (the original it is matlab) it runs kinda faster on a slower computer than mine (haven’t tested it myself). Is there any configuration that I can do to increase the performance?
Will increasing the number of cores help? How can I do it? (I tried using Nd4jEnvironment.getEnvironment()… = something but no luck)

I’m attaching some lines that may be useful. Running on 1.0.0-alpha.

2021-10-11 22:54:04.718  INFO 13464 --- [0.1-8080-exec-1] org.nd4j.nativeblas.NativeOpsHolder      : Number of threads used for NativeOps: 8
2021-10-11 22:54:04.873  INFO 13464 --- [0.1-8080-exec-1] org.nd4j.nativeblas.Nd4jBlas             : Number of threads used for BLAS: 8
2021-10-11 22:54:04.874  INFO 13464 --- [0.1-8080-exec-1] o.n.l.a.o.e.DefaultOpExecutioner         : Backend used: [CPU]; OS: [Windows 10]
2021-10-11 22:54:04.874  INFO 13464 --- [0.1-8080-exec-1] o.n.l.a.o.e.DefaultOpExecutioner         : Cores: [16]; Memory: [4.0GB];
2021-10-11 22:54:04.874  INFO 13464 --- [0.1-8080-exec-1] o.n.l.a.o.e.DefaultOpExecutioner         : Blas vendor: [OPENBLAS]

@officiallor could you please upgrade to 1.0.0-M1.1? alpha is at least 18 months old. Please come back after you’ve done that. Thanks!

@agibsonccc I will try that, but I have some helpers based on this version and upgrading may cause problems. Anything else that I can do?

@officiallor just try the upgrade first. I’m not going to chase down performance problems on an old version. After that we can look more in depth. In order to contribute anything useful to the discussion, I’ll need a lot of specifics from you that you yourself will have to give me including reproducing the problem with code I can run on my own computer.

I will try to upgrade these days. Is it possible to use more threads on this version?
Is it normal though, even in this version, that in Matlab something runs in let’s say 5’ and on a good PC on much more time?

@officiallor yes you can use OMP_NUM_THREADS for math threads.
If you want to run nd4j in a multi threaded application, could you elaborate your use case a bit? It’s very possible, but I would have more specific recommendations for you fi that’s what you’re looking for.

Edit: Could you please elaborate also on what’s blocking you from upgrading? Are you running in a special environment or maybe you just followed a tutorial and you’re unfamiliar with maven? When I say “upgrade” what I imagine (as at least a java developer myself) is “change the version in the pom.xml in intellij” which shouldn’t be hard but I also understand might be a bit annoying for someone who’s new to these things.

Edit 2: I’d also like to know how you ended up with a fairly old version of dl4j. We don’t usually specify old versions in our docs and tutorials and all of our examples are updated each release.

How can I set OMP_NUM_THREADS though? I was looking yesterday and couldn’t do it. I read the docs etc and tried some things but it didn’t work.
My case is that I’m translating (almost finished) matlab code to kotlin/nd4j that is more or less, iterations, multiplications, additions etc on relatively small INDArrays. Most of them are max 2d and two of them 3d.
I have some code that is written for the old version of nd4j, so I will have to change/fix some things. I know, I will have to change the version of nd4j on pom.xml and I will.

@officiallor it’s just an environment variable. We use openblas under neath.

Thank you!

I was trying to set it using code before. There was a property that had a setter.
I’m trying with the OMP_NUM_THREADS 16 and it looks kinda faster now that I’m running it (haven’t finished execution yet). It is using more RAM though, and it is a good sign, cause before it was using up to 700 or something and not it’s ~1.3GB and it is expected I believe.
I will try this week to update the nd4j version and hope there are not many incompatibilities to fix so I can retry the “speed test” sooner.

2021-10-12 22:52:07.667  INFO 2616 --- [0.1-8080-exec-1] org.nd4j.linalg.factory.Nd4jBackend      : Loaded [CpuBackend] backend
2021-10-12 22:52:07.937  INFO 2616 --- [0.1-8080-exec-1] org.nd4j.nativeblas.NativeOpsHolder      : Number of threads used for NativeOps: 16
2021-10-12 22:52:08.109  INFO 2616 --- [0.1-8080-exec-1] org.nd4j.nativeblas.Nd4jBlas             : Number of threads used for BLAS: 16
2021-10-12 22:52:08.111  INFO 2616 --- [0.1-8080-exec-1] o.n.l.a.o.e.DefaultOpExecutioner         : Backend used: [CPU]; OS: [Windows 10]
2021-10-12 22:52:08.111  INFO 2616 --- [0.1-8080-exec-1] o.n.l.a.o.e.DefaultOpExecutioner         : Cores: [16]; Memory: [4.0GB];
2021-10-12 22:52:08.111  INFO 2616 --- [0.1-8080-exec-1] o.n.l.a.o.e.DefaultOpExecutioner         : Blas vendor: [OPENBLAS]

@officiallor great to hear! Thanks for trying. The other bottlenecks will be around things like GC frequency among other things. There are tools for that, but first I’d like to make sure you are comfortable using everything and then we can start slowly tuning your workload there.

@agibsonccc Fixed all the compilation errors but on runtime when I try to do something like the following, I get Can’t transpose array with rank < 2: array shape [4]> and I can’t go further to see if the other “fixes” worked. Most of the issues were caused from changing Int to Long for get(), interval() etc.

With the old version, I could do it. I just want to make this (see code below) a column. Am I missing something?
I would like to avoid iterating though the java list and assign to a newly created (Nd4j.create(4,1)) the values.
The code below is just for demonstrating the problem.

val test = Nd4j.create(doubleArrayOf(1.0,2.0,3.0,4.0)).transposei()

The console output:

04:33:33.210 [main] INFO org.nd4j.linalg.factory.Nd4jBackend - Loaded [CpuBackend] backend
04:33:34.091 [main] INFO org.nd4j.nativeblas.NativeOpsHolder - Number of threads used for linear algebra: 16
04:33:34.092 [main] INFO org.nd4j.linalg.cpu.nativecpu.CpuNDArrayFactory - Binary level Generic x86 optimization level AVX/AVX2
04:33:34.145 [main] INFO org.nd4j.nativeblas.Nd4jBlas - Number of threads used for OpenMP BLAS: 16
04:33:34.171 [main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Backend used: [CPU]; OS: [Windows 10]
04:33:34.171 [main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Cores: [16]; Memory: [4.0GB];
04:33:34.171 [main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Blas vendor: [OPENBLAS]
04:33:34.174 [main] INFO org.nd4j.linalg.cpu.nativecpu.CpuBackend - Backend build information:
 GCC: "10.3.0"
STD version: 201103L
Exception in thread "main" java.lang.IllegalStateException: Can't transpose array with rank < 2: array shape [4]
	at org.nd4j.common.base.Preconditions.throwStateEx(Preconditions.java:638)
	at org.nd4j.common.base.Preconditions.checkState(Preconditions.java:301)
	at org.nd4j.linalg.api.ndarray.BaseNDArray.transposei(BaseNDArray.java:3714)

EDIT: Trying with reshape() to fix things —> Fixed this
EDIT2: java.lang.OutOfMemoryError: Physical memory usage is too high: physicalBytes (8340M) > maxPhysicalBytes (8168M)

@officiallor Could you clarify the problem in your second edit? I have nothing to go on for your memory issue. Anything like this I would ideally copy and paste something, run it in my IDE and see the problem.

I ran the code after finishing the fixes of some indexes that I changed from Int to Long and RAM and CPU started increasing up to 11GB. Here is the specific line

            in.getOf((n - 1), 0).assign(
                saved.getAllColumnsOfRow((n - 1)).mul(timestep / 60f)
                    .mul(retail.mul((1 + e_infr).pow(n.toFloat() - 1)))
                    .add( **//Debugger says here is the exception thrown**
                        sold.getAllColumnsOfRow(n - 1).mul(timestep / 60f)
                            .mul(exc_enrg_pr * (1 + e_infr).pow(n.toFloat() - 1))

//here is the initialization of sold and saved, the 1st arg is parametric
//It is 20 as I checked, so there is no mismatch or something else on the sizes
val sold = zeroes(20L, 35040L) 
val saved = zeroes(20L, 35040L)
//Additional info: 
e_infr = 2/100f
n is in a for loop from 1..20 (20 inclusive)
timestep = 15
exc_enrg_pr = 0

Here is the stacktrace

java.lang.OutOfMemoryError: Physical memory usage is too high: physicalBytes (9690M) > maxPhysicalBytes (8168M)
	at org.bytedeco.javacpp.Pointer.deallocator(Pointer.java:700) ~[javacpp-1.5.5.jar:1.5.5]
	at org.bytedeco.javacpp.Pointer.init(Pointer.java:126) ~[javacpp-1.5.5.jar:1.5.5]
	at org.bytedeco.javacpp.LongPointer.allocateArray(Native Method) ~[javacpp-1.5.5.jar:1.5.5]
	at org.bytedeco.javacpp.LongPointer.<init>(LongPointer.java:80) ~[javacpp-1.5.5.jar:1.5.5]
	at org.bytedeco.javacpp.LongPointer.<init>(LongPointer.java:53) ~[javacpp-1.5.5.jar:1.5.5]
	at org.nd4j.linalg.cpu.nativecpu.ops.NativeOpExecutioner.createShapeInfo(NativeOpExecutioner.java:2016) ~[nd4j-native-1.0.0-M1.1.jar:na]
	at org.nd4j.linalg.api.shape.Shape.createShapeInformation(Shape.java:3247) ~[nd4j-api-1.0.0-M1.1.jar:1.0.0-M1.1]
	at org.nd4j.linalg.api.ndarray.BaseShapeInfoProvider.createShapeInformation(BaseShapeInfoProvider.java:68) ~[nd4j-api-1.0.0-M1.1.jar:1.0.0-M1.1]
	at org.nd4j.linalg.api.ndarray.BaseNDArray.<init>(BaseNDArray.java:213) ~[nd4j-api-1.0.0-M1.1.jar:1.0.0-M1.1]
	at org.nd4j.linalg.api.ndarray.BaseNDArray.<init>(BaseNDArray.java:324) ~[nd4j-api-1.0.0-M1.1.jar:1.0.0-M1.1]
	at org.nd4j.linalg.cpu.nativecpu.NDArray.<init>(NDArray.java:191) ~[nd4j-native-1.0.0-M1.1.jar:na]
	at org.nd4j.linalg.cpu.nativecpu.CpuNDArrayFactory.createUninitialized(CpuNDArrayFactory.java:226) ~[nd4j-native-1.0.0-M1.1.jar:na]
	at org.nd4j.linalg.factory.Nd4j.createUninitialized(Nd4j.java:4364) ~[nd4j-api-1.0.0-M1.1.jar:1.0.0-M1.1]
	at org.nd4j.linalg.api.ndarray.BaseNDArray.add(BaseNDArray.java:3089) ~[nd4j-api-1.0.0-M1.1.jar:1.0.0-M1.1]

EDIT: Added additional info for the variables that are being used in the specific code

@officiallor One thing that sticks out immediately is you are calling mul/add. Each of those creates a new array. It would be much easier for you to call the in place versions like muli/addi/powi

Could you try eliminating that as a first step and let me know what that does?

Thank you,
Yes, I will try it. I’m trying to run the old version cause I messed up something there and then I will go back to the new.
In the old version, I don’t know if it’s worth anything, I did this constantly. I used addi, muli in some simple occasions and used more mul, add etc for everything else.
It have the same results, right? I used the copy cause it looked more safe to be honest.
If it will have the same results and it is more effective I will change it of course.

@officiallor the way to think about it is something like:

temp = x + 1
temp += 2
temp += 3
temp -=4

You can do something like this where you create one temp array and then use that intermediary result instead. The alternative is creating an array every time as well as copying the results over.
In place means you’re allocating less as well as faster execution.

Since your problem is memory, less allocations will always help. There’s ways around this like increasing the GC frequency but it’s not really something you want if youc an avoid it.

After checking that the old version is good I will do it. I was kinda afraid that it could cause unexpected results. So, wherever there is mul, add etc I will have to replace with muli, addi in this line?

The result:

            in.getOf((n - 1), 0).assign(
                saved.getAllColumnsOfRow((n - 1)).muli(timestep / 60f)
                    .muli(retail.muli((1 + e_infr).pow(n.toFloat() - 1)))
                    .addi( **//Debugger says here is the exception thrown**
                        sold.getAllColumnsOfRow(n - 1).muli(timestep / 60f)
                            .muli(exc_enrg_pr * (1 + e_infr).pow(n.toFloat() - 1))

@officiallor Just make sure the first result is a copy operation with the rest being in place.
It might not be as concise but for performance reasons you could also break things up in to separate lines or even private helper functions.

@agibsonccc I think I don’t understand :sweat_smile:
Do you mean to change the first muli to mul? This makes sense

@officiallor Sorry was trying to generalize it a bit for you.
Yes you have the right idea.
When you want to increase performance and reduce allocations you always ensure that your result array gets created once.
That happens by doing what I described: create the temp buffer first with mul. Then make everything else in place.