Unexpected Slow Performance

With the following and 16 Threads it takes 3’ to crash with OutOfMemoryError. With 8 Threads it crashes faster. The memory usage spikes upto 12GB and then goes lower. The CPU is around 70%+.
I couldn’t use muli on the second line cause I was getting another exception:
java.lang.IllegalStateException: Cannot perform in-place operation "muli": result array shape does not match the broadcast operation output shape: [35040].muli([35040, 1]) != [35040].

            in.getOf((n-1),0).assign(saved.getAllColumnsOfRow((n-1)).mul(timestep / 60f)
                .mul(retail.muli((1 + e_infr).pow(n.toInt() - 1)))
                .addi(sold.getAllColumnsOfRow(n-1).muli(timestep / 60f)
                .muli(exc_enrg_pr * (1 + e_infr).pow(n.toInt() - 1))).sumFloat())

@officiallor Hmm…that looks a bit off. If you’re only using vectors you can easily reshape there to ensure ti’s broadcastable. It might be worth doing that for your use case there. If you don’t know the length of the output you can just do .reshape(length) or something similar.

Could you give me a bit more than your loop here or something I could run on my end to dissect this a bit for you? If I can run a self contained calculation I can try to give you something a bit mor efficient.

I am giving the idea here in the following code. I don’t know what else I can give to help

        val n_max = 20
        val e_infr = 2
        val timestep = 15
        val exc_enrg_pr = 0
        val in = zeroes(n_max.toLong(), 1)
        val out = zeroes(n_max.toLong(), 1)
        val sold = zeroes(n_max.toLong(), 35040)
        val saved = zeroes(n_max.toLong(), 35040)
        var retail =   zeroes(1, 35040)
       val numberHere = please enter a random number of your choice

                for (i in 1..35040 step 96) { //this for is written to fill retail for the example, 
                                             //the original has other operations inside - I just want to fill retail
                    retail.get(NDArrayIndex.all(), NDArrayIndex.interval(i - 1, i + 95)).assign(numberHere)

             retail = retail.transpose().mul(1 + numberHere / 100f)

           for (n in 1 .. n_max.toLong()) {
             in.getOf((n - 1), 0).assign(
                saved.getAllColumnsOfRow((n - 1)).mul(timestep / 60f)
                    .mul(retail.mul((1 + e_infr).pow(n.toFloat() - 1)))
                        sold.getAllColumnsOfRow(n - 1).mul(timestep / 60f)
                            .mul(exc_enrg_pr * (1 + e_infr).pow(n.toFloat() - 1))

@officiallor it’ll take me some time to replicate this. When I asked for code, I meant some I could copy/paste and run really fast not psuedo code.
By not giving it to me in exact form, I now have to spend time understanding your interpretation of certain things as well as re implement psuedo code like getOf(…) when all I should be doing is focusing on your performance issue.

I’ll take a look but I want to just give that feedback for future reference. Costing me extra time doesn’t help me help you :slight_smile:

@officiallor here’s beginning sample I’m looking at and will debug:

    public static void main(String...args) {
        int nMax = 20;
        int eInfr = 2;
        int timeStep = 15;
        int excEnrgPr = 0;
        INDArray in = Nd4j.zeros(nMax,1);
        INDArray out = Nd4j.zeros(nMax,1);
        INDArray sold = Nd4j.zeros(nMax,35040);
        INDArray saved = Nd4j.zeros(nMax,35040);
        INDArray retail = Nd4j.zeros(1,35040);
        int numberHere = 2;
        for(int i = 1; i < 35040; i+= 96) {
            retail.get(NDArrayIndex.all(), NDArrayIndex.interval(i - 1, i + 95)).assign(numberHere);

        retail = retail.transpose().mul(1 + numberHere / 100f);
        for(int i = 1; i < nMax; i++) {
            in.slice((i - 1), 0).assign(
                    saved.slice((i - 1)).mul(timeStep / 60f)
                            .muli(retail.mul(Math.pow(1 + eInfr,i - 1)))
                            .addi(sold.slice(i - 1).muli(timeStep / 60f)
                                            .muli(Math.pow(excEnrgPr * (1 + eInfr),i - 1))



I’ll post this so you can follow along.

Ohh, sorry but I couldn’t help more cause it’s not so straight forward.
The getOf() accepts vararg Long. It creates NDArrayIndexes and calls get() with those indexes as parameter. It can be replaced with get(NDArrayIndex.point(y),NDArrayIndex.point(z)) here

@officiallor do you mind posting your helpers just in case?

Edit: Good news is I’m reproducing your memory issues. So this is a start to me looking at it in depth. Thanks for meeting me in the middle a bit.

Can I send them privately? I mean it’s a big file. Or I can send the specific ones that I’m using in this example in a PM.

I noticed that in for loops you are using < but it is <=. I didn’t make it clear I think in the last posts. The … in kotlin is inclusive.