Unexpected Slow Performance

officiallor · October 11, 2021, 8:13pm

Hello,
I am facing slow performance, contrary to what I believe it should be. I have a 8core Ryzen 3700x and 16GB of DDR4 RAM.
I expected kinda faster calculations but the java process doesn’t go more than 10% and about 500MB+ RAM.
I can’t do many modifications on the code but as far as I know, on native Matlab (the original it is matlab) it runs kinda faster on a slower computer than mine (haven’t tested it myself). Is there any configuration that I can do to increase the performance?
Will increasing the number of cores help? How can I do it? (I tried using Nd4jEnvironment.getEnvironment()… = something but no luck)

I’m attaching some lines that may be useful. Running on 1.0.0-alpha.

2021-10-11 22:54:04.718  INFO 13464 --- [0.1-8080-exec-1] org.nd4j.nativeblas.NativeOpsHolder      : Number of threads used for NativeOps: 8
2021-10-11 22:54:04.873  INFO 13464 --- [0.1-8080-exec-1] org.nd4j.nativeblas.Nd4jBlas             : Number of threads used for BLAS: 8
2021-10-11 22:54:04.874  INFO 13464 --- [0.1-8080-exec-1] o.n.l.a.o.e.DefaultOpExecutioner         : Backend used: [CPU]; OS: [Windows 10]
2021-10-11 22:54:04.874  INFO 13464 --- [0.1-8080-exec-1] o.n.l.a.o.e.DefaultOpExecutioner         : Cores: [16]; Memory: [4.0GB];
2021-10-11 22:54:04.874  INFO 13464 --- [0.1-8080-exec-1] o.n.l.a.o.e.DefaultOpExecutioner         : Blas vendor: [OPENBLAS]

agibsonccc · October 11, 2021, 9:42pm

@officiallor could you please upgrade to 1.0.0-M1.1? alpha is at least 18 months old. Please come back after you’ve done that. Thanks!

officiallor · October 12, 2021, 8:53am

@agibsonccc I will try that, but I have some helpers based on this version and upgrading may cause problems. Anything else that I can do?

agibsonccc · October 12, 2021, 9:07am

@officiallor just try the upgrade first. I’m not going to chase down performance problems on an old version. After that we can look more in depth. In order to contribute anything useful to the discussion, I’ll need a lot of specifics from you that you yourself will have to give me including reproducing the problem with code I can run on my own computer.

officiallor · October 12, 2021, 1:21pm

I will try to upgrade these days. Is it possible to use more threads on this version?
Is it normal though, even in this version, that in Matlab something runs in let’s say 5’ and on a good PC on much more time?

agibsonccc · October 12, 2021, 1:24pm

@officiallor yes you can use OMP_NUM_THREADS for math threads.
If you want to run nd4j in a multi threaded application, could you elaborate your use case a bit? It’s very possible, but I would have more specific recommendations for you fi that’s what you’re looking for.

Edit: Could you please elaborate also on what’s blocking you from upgrading? Are you running in a special environment or maybe you just followed a tutorial and you’re unfamiliar with maven? When I say “upgrade” what I imagine (as at least a java developer myself) is “change the version in the pom.xml in intellij” which shouldn’t be hard but I also understand might be a bit annoying for someone who’s new to these things.

Edit 2: I’d also like to know how you ended up with a fairly old version of dl4j. We don’t usually specify old versions in our docs and tutorials and all of our examples are updated each release.

officiallor · October 12, 2021, 2:09pm

How can I set OMP_NUM_THREADS though? I was looking yesterday and couldn’t do it. I read the docs etc and tried some things but it didn’t work.
My case is that I’m translating (almost finished) matlab code to kotlin/nd4j that is more or less, iterations, multiplications, additions etc on relatively small INDArrays. Most of them are max 2d and two of them 3d.
I have some code that is written for the old version of nd4j, so I will have to change/fix some things. I know, I will have to change the version of nd4j on pom.xml and I will.

agibsonccc · October 12, 2021, 2:38pm

@officiallor it’s just an environment variable. We use openblas under neath.

officiallor · October 12, 2021, 8:03pm

Thank you!

I was trying to set it using code before. There was a property that had a setter.
I’m trying with the OMP_NUM_THREADS 16 and it looks kinda faster now that I’m running it (haven’t finished execution yet). It is using more RAM though, and it is a good sign, cause before it was using up to 700 or something and not it’s ~1.3GB and it is expected I believe.
I will try this week to update the nd4j version and hope there are not many incompatibilities to fix so I can retry the “speed test” sooner.

2021-10-12 22:52:07.667  INFO 2616 --- [0.1-8080-exec-1] org.nd4j.linalg.factory.Nd4jBackend      : Loaded [CpuBackend] backend
2021-10-12 22:52:07.937  INFO 2616 --- [0.1-8080-exec-1] org.nd4j.nativeblas.NativeOpsHolder      : Number of threads used for NativeOps: 16
2021-10-12 22:52:08.109  INFO 2616 --- [0.1-8080-exec-1] org.nd4j.nativeblas.Nd4jBlas             : Number of threads used for BLAS: 16
2021-10-12 22:52:08.111  INFO 2616 --- [0.1-8080-exec-1] o.n.l.a.o.e.DefaultOpExecutioner         : Backend used: [CPU]; OS: [Windows 10]
2021-10-12 22:52:08.111  INFO 2616 --- [0.1-8080-exec-1] o.n.l.a.o.e.DefaultOpExecutioner         : Cores: [16]; Memory: [4.0GB];
2021-10-12 22:52:08.111  INFO 2616 --- [0.1-8080-exec-1] o.n.l.a.o.e.DefaultOpExecutioner         : Blas vendor: [OPENBLAS]

agibsonccc · October 12, 2021, 10:56pm

@officiallor great to hear! Thanks for trying. The other bottlenecks will be around things like GC frequency among other things. There are tools for that, but first I’d like to make sure you are comfortable using everything and then we can start slowly tuning your workload there.

officiallor · October 13, 2021, 1:51am

@agibsonccc Fixed all the compilation errors but on runtime when I try to do something like the following, I get Can’t transpose array with rank < 2: array shape [4]> and I can’t go further to see if the other “fixes” worked. Most of the issues were caused from changing Int to Long for get(), interval() etc.

With the old version, I could do it. I just want to make this (see code below) a column. Am I missing something?
I would like to avoid iterating though the java list and assign to a newly created (Nd4j.create(4,1)) the values.
The code below is just for demonstrating the problem.

val test = Nd4j.create(doubleArrayOf(1.0,2.0,3.0,4.0)).transposei()

The console output:

04:33:33.210 [main] INFO org.nd4j.linalg.factory.Nd4jBackend - Loaded [CpuBackend] backend
04:33:34.091 [main] INFO org.nd4j.nativeblas.NativeOpsHolder - Number of threads used for linear algebra: 16
04:33:34.092 [main] INFO org.nd4j.linalg.cpu.nativecpu.CpuNDArrayFactory - Binary level Generic x86 optimization level AVX/AVX2
04:33:34.145 [main] INFO org.nd4j.nativeblas.Nd4jBlas - Number of threads used for OpenMP BLAS: 16
04:33:34.171 [main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Backend used: [CPU]; OS: [Windows 10]
04:33:34.171 [main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Cores: [16]; Memory: [4.0GB];
04:33:34.171 [main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Blas vendor: [OPENBLAS]
04:33:34.174 [main] INFO org.nd4j.linalg.cpu.nativecpu.CpuBackend - Backend build information:
 GCC: "10.3.0"
STD version: 201103L
DEFAULT_ENGINE: samediff::ENGINE_CPU
HAVE_FLATBUFFERS
HAVE_OPENBLAS
Exception in thread "main" java.lang.IllegalStateException: Can't transpose array with rank < 2: array shape [4]
	at org.nd4j.common.base.Preconditions.throwStateEx(Preconditions.java:638)
	at org.nd4j.common.base.Preconditions.checkState(Preconditions.java:301)
	at org.nd4j.linalg.api.ndarray.BaseNDArray.transposei(BaseNDArray.java:3714)

EDIT: Trying with reshape() to fix things —> Fixed this
EDIT2: java.lang.OutOfMemoryError: Physical memory usage is too high: physicalBytes (8340M) > maxPhysicalBytes (8168M)

agibsonccc · October 13, 2021, 4:44am

@officiallor Could you clarify the problem in your second edit? I have nothing to go on for your memory issue. Anything like this I would ideally copy and paste something, run it in my IDE and see the problem.

officiallor · October 13, 2021, 7:04am

I ran the code after finishing the fixes of some indexes that I changed from Int to Long and RAM and CPU started increasing up to 11GB. Here is the specific line

            in.getOf((n - 1), 0).assign(
                saved.getAllColumnsOfRow((n - 1)).mul(timestep / 60f)
                    .mul(retail.mul((1 + e_infr).pow(n.toFloat() - 1)))
                    .add( **//Debugger says here is the exception thrown**
                        sold.getAllColumnsOfRow(n - 1).mul(timestep / 60f)
                            .mul(exc_enrg_pr * (1 + e_infr).pow(n.toFloat() - 1))
                    ).sumNumber().toFloat()
            )


//here is the initialization of sold and saved, the 1st arg is parametric
//It is 20 as I checked, so there is no mismatch or something else on the sizes
val sold = zeroes(20L, 35040L) 
val saved = zeroes(20L, 35040L)
 
//Additional info: 
e_infr = 2/100f
n is in a for loop from 1..20 (20 inclusive)
timestep = 15
exc_enrg_pr = 0

Here is the stacktrace


java.lang.OutOfMemoryError: Physical memory usage is too high: physicalBytes (9690M) > maxPhysicalBytes (8168M)
	at org.bytedeco.javacpp.Pointer.deallocator(Pointer.java:700) ~[javacpp-1.5.5.jar:1.5.5]
	at org.bytedeco.javacpp.Pointer.init(Pointer.java:126) ~[javacpp-1.5.5.jar:1.5.5]
	at org.bytedeco.javacpp.LongPointer.allocateArray(Native Method) ~[javacpp-1.5.5.jar:1.5.5]
	at org.bytedeco.javacpp.LongPointer.<init>(LongPointer.java:80) ~[javacpp-1.5.5.jar:1.5.5]
	at org.bytedeco.javacpp.LongPointer.<init>(LongPointer.java:53) ~[javacpp-1.5.5.jar:1.5.5]
	at org.nd4j.linalg.cpu.nativecpu.ops.NativeOpExecutioner.createShapeInfo(NativeOpExecutioner.java:2016) ~[nd4j-native-1.0.0-M1.1.jar:na]
	at org.nd4j.linalg.api.shape.Shape.createShapeInformation(Shape.java:3247) ~[nd4j-api-1.0.0-M1.1.jar:1.0.0-M1.1]
	at org.nd4j.linalg.api.ndarray.BaseShapeInfoProvider.createShapeInformation(BaseShapeInfoProvider.java:68) ~[nd4j-api-1.0.0-M1.1.jar:1.0.0-M1.1]
	at org.nd4j.linalg.api.ndarray.BaseNDArray.<init>(BaseNDArray.java:213) ~[nd4j-api-1.0.0-M1.1.jar:1.0.0-M1.1]
	at org.nd4j.linalg.api.ndarray.BaseNDArray.<init>(BaseNDArray.java:324) ~[nd4j-api-1.0.0-M1.1.jar:1.0.0-M1.1]
	at org.nd4j.linalg.cpu.nativecpu.NDArray.<init>(NDArray.java:191) ~[nd4j-native-1.0.0-M1.1.jar:na]
	at org.nd4j.linalg.cpu.nativecpu.CpuNDArrayFactory.createUninitialized(CpuNDArrayFactory.java:226) ~[nd4j-native-1.0.0-M1.1.jar:na]
	at org.nd4j.linalg.factory.Nd4j.createUninitialized(Nd4j.java:4364) ~[nd4j-api-1.0.0-M1.1.jar:1.0.0-M1.1]
	at org.nd4j.linalg.api.ndarray.BaseNDArray.add(BaseNDArray.java:3089) ~[nd4j-api-1.0.0-M1.1.jar:1.0.0-M1.1]

EDIT: Added additional info for the variables that are being used in the specific code

agibsonccc · October 13, 2021, 7:31am

@officiallor One thing that sticks out immediately is you are calling mul/add. Each of those creates a new array. It would be much easier for you to call the in place versions like muli/addi/powi

Could you try eliminating that as a first step and let me know what that does?

officiallor · October 13, 2021, 8:00am

Thank you,
Yes, I will try it. I’m trying to run the old version cause I messed up something there and then I will go back to the new.
In the old version, I don’t know if it’s worth anything, I did this constantly. I used addi, muli in some simple occasions and used more mul, add etc for everything else.
It have the same results, right? I used the copy cause it looked more safe to be honest.
If it will have the same results and it is more effective I will change it of course.

agibsonccc · October 13, 2021, 8:04am

@officiallor the way to think about it is something like:

temp = x + 1
temp += 2
temp += 3
temp -=4

You can do something like this where you create one temp array and then use that intermediary result instead. The alternative is creating an array every time as well as copying the results over.
In place means you’re allocating less as well as faster execution.

Since your problem is memory, less allocations will always help. There’s ways around this like increasing the GC frequency but it’s not really something you want if youc an avoid it.

officiallor · October 13, 2021, 8:20am

After checking that the old version is good I will do it. I was kinda afraid that it could cause unexpected results. So, wherever there is mul, add etc I will have to replace with muli, addi in this line?

The result:

            in.getOf((n - 1), 0).assign(
                saved.getAllColumnsOfRow((n - 1)).muli(timestep / 60f)
                    .muli(retail.muli((1 + e_infr).pow(n.toFloat() - 1)))
                    .addi( **//Debugger says here is the exception thrown**
                        sold.getAllColumnsOfRow(n - 1).muli(timestep / 60f)
                            .muli(exc_enrg_pr * (1 + e_infr).pow(n.toFloat() - 1))
                    ).sumNumber().toFloat()
            )

agibsonccc · October 13, 2021, 8:24am

@officiallor Just make sure the first result is a copy operation with the rest being in place.
It might not be as concise but for performance reasons you could also break things up in to separate lines or even private helper functions.

officiallor · October 13, 2021, 9:12am

@agibsonccc I think I don’t understand
Do you mean to change the first muli to mul? This makes sense

agibsonccc · October 13, 2021, 10:13am

@officiallor Sorry was trying to generalize it a bit for you.
Yes you have the right idea.
When you want to increase performance and reduce allocations you always ensure that your result array gets created once.
That happens by doing what I described: create the temp buffer first with mul. Then make everything else in place.

Topic		Replies	Views
AMD Ryzen 5000 CPU - Poor Performance DL4J	16	939	August 14, 2021
ND4J slow on M3 ND4J	1	55	August 19, 2024
How can I accelerate the progress of Nd4j.createFromArray()? ND4J	1	399	October 28, 2020
Nd4j.scatterUpdates slower than simple CPU implementation ND4J	3	193	September 2, 2023
Is ND4J still active? ND4J	4	429	October 30, 2023

Unexpected Slow Performance

Related topics