Fastest/preferred way to get values of multiple indexes of an array

Assuming I have two arrays, one with a set of values, the other one with a set of indices - what is the preferred way of getting all values of the given indices? The same question was presented in this stackoverflow post a few years ago, with no truly satisfactory answer regarding performance. Is the solution presented there the best you can do in Nd4j?

@JBraig my answer is the same. What you’re asking for is physically impossible. Views that can’t be calculated (eg: using SpecifiedIndex) and are just sporadic are by definition not going to be fast.
That means a copy.

Fast creation of views only happens when you can slice arrays a certain way.
So it’s not about it being satisfactory it’s not about being able to make an impossibility possible.
Views vary in how they are calculated but anything using a Specified Index (not all,interval or somethig more predictable) will not allow you to create strides/offsets that match a valid view. You might want to read up on how ndarrays actually work. Hre’s a good reference: https://deeplearning4j.konduit.ai/nd4j/reference

Alright, then it seems like the solutions over there are the best I’ll get. But now I’m curious: Do you know how numpy manages to be so much faster for the same kind of operation? For reference, I have currently a value array of size [81000,3] and an index array with 4384 entries, each corresponding to a different row of the value array.

My fastest Nd4j-implementation looks like this:

INDArray vals = ...  // All my values
long[] inds = ...    // All my indexes

INDArray out = Nd4j.zeros(inds.length, 3);
for (int i = 0; i < inds.length; i++) {
    out.putRow(i, vals.get(NDArrayIndex.point(inds[i])));
}

This (at least for me) is ever so slightly faster than INDArray out = vals.get(NDArrayIndex.indices(inds));, but still takes about 250 ms on my machine. OTOH, my numpy code looks like this

out = vals[inds, :]

with vals, inds and out all being numpy ndarrays. This implementation for the same values as above takes less than 1 ms. Do you know why it’s so much faster? Do they have a different general approach?

@JBraig it depends. If there’s something specific we can look in to it could be a specific combination of indices that may not need a copy. Generally numpy uses something called an NDIndexIterator (we do as well) for placing elements that are all over the place.
I’m not against trying to make something faster. Let me take a look at what numpy might do there.

What I’m seeing here is actually not equivalent though. You’re just calling a get operation.
With nd4j you’re actually manually creating everything including a manual for loop.
That includes a bunch of allocations. You’re actually doing 2 allocations right there with the way you are doing it. Per row you have the on heap information in java and the off heap buffer.

For the matrix at least you can actually use:
arr.getRows(…);

This will use c which will also avoid the allocations.

This in turn will use c for the operation. Numpy is similar.

Thanks for looking into it.

For the matrix at least you can actually use:
arr.getRows(…);

I tried that as well:

INDArray out = vals.getRows(inds);

and it’s marginally slower than the for loop above (but faster than using vals.get(NDArrayIndex.indices(inds))).

@JBraig how many rows? Sometimes that’s due to off heap memory overhead as well. You can run in to that with toy micro benchmarks but discover that it’s faster for larger matrices.

Same as above, 81000 rows (and 3 columns) in the full array, with 4384 indexes that I want to get. That’s also the size of the numpy array I tested it against. I basically tested three versions for the Nd4j version:

  1. INDArray out = vals.get(NDArrayIndex.indices(inds));
  2. INDArray out = vals.getRows(inds);
  3. The for-loop

As I said, all three are significantly slower than the numpy implementation. Unsurprisingly, if I reduce the number of indexes I want, speed improves for all three methods. Which one of the three is the fastest indeed changes a bit, but it’s never by a huge amount compared to the other two versions:

4384 Indices:
vals.get() - 257 ms
vals.getRows() - 245 ms
For-loop - 238 ms

7684 Indices:
vals.get() - 426 ms
vals.getRows() - 461 ms
For-loop - 446 ms

1272 Indices:
vals.get() - 83 ms
vals.getRows() - 85 ms
For-loop - 79 ms

(The number of indices seem a bit random because in my case, I only get them indirectly based on a the values in a different array)

I also checked with different sized value arrays. Very generally, the getRows()-method seems to do worse for larger value arrays, but the results are more or less the same with all methods being in the same ballpark, and the numpy implementation is always much faster.