ArrayStore removal error during SameDiff training

Hi. While training a SameDiff model I get an exception if I try to increase my batch size:

Exception in thread "main" java.lang.IllegalStateException: Cannot remove array from ArrayStore: no array with this length exists in the cache
	at org.nd4j.common.base.Preconditions.throwStateEx(
	at org.nd4j.common.base.Preconditions.checkState(
	at org.nd4j.autodiff.samediff.internal.memory.ArrayCacheMemoryMgr$ArrayStore.removeObject(
	at org.nd4j.autodiff.samediff.internal.memory.ArrayCacheMemoryMgr$ArrayStore.access$100(
	at org.nd4j.autodiff.samediff.internal.memory.ArrayCacheMemoryMgr.release(
	at org.nd4j.autodiff.samediff.internal.InferenceSession.getOutputs(
	at org.nd4j.autodiff.samediff.internal.TrainingSession.getOutputs(
	at org.nd4j.autodiff.samediff.internal.TrainingSession.getOutputs(
	at org.nd4j.autodiff.samediff.internal.AbstractSession.output(
	at org.nd4j.autodiff.samediff.internal.TrainingSession.trainingIteration(
	at org.nd4j.autodiff.samediff.SameDiff.fitHelper(
	at org.nd4j.autodiff.samediff.config.FitConfig.exec(

My model is a sort of Transformer’s encoder and the main part of it is self-attention ( MultiHeadDotProductAttention ). Having the batch size of 120 and sequence length of 64 with embedding size (hidden size) of 768 was working fine. But after increasing the batch size to 140 (or more) or the sequence length to 128 (or more) I got the failure. My aim is to have at least 256 sequences in a batch and a sequence length of 128. I also have the following settings for a frequent GC cleanup, but the error is triggered even without them :

I think I found the root cause of those errors.
There are 2 places in the code of which seem to be having logic errors

  1. Line 255 checks for index > 0 but it should check for index >=0. Arrays.binarySearch() can return any index between 0 and array length.
  2. Lines 258-260 (‘while’ loop) seems also to have logic issues. I’ve replaced the lines 256-262 with the following code
    range(0, size)
    .filter(index -> lengths[index] == length && sorted[index] == array)
    .ifPresentOrElse(this::removeIdx, ()->
    LOG.warn("An array of type {} with ID {} and length {} hasn't been removed from cache", array.dataType(), array.getId(), array.length()));

Those changes seem to be working. After running a couple of iterations I found no WARN logs which means that all old arrays are being removed from cache properly.

Should I report a bug ?

Yes, SameDiff is still not stable.

@SidneyLann @partarstu thanks for helping here. I’m currently working on finishing our new model import framework first. Please feel free to submit pull requests and I’m happy to merge them if you find issues. Otherwise, I’m happy to take a look.

@agibsonccc I’ll take a look at the contribution specifics for the project and will create a PR after that

sameDiff.loss.softmaxCrossEntropy(“loss”, label=[1024,6], out=[1024,6], weight=[6])
this function has error and I have no idea

[main] ERROR org.nd4j.linalg.cpu.nativecpu.ops.NativeOpExecutioner - Failed to execute op softmax_cross_entropy_loss. Attempted to execute with 3 inputs, 1 outputs, 1 targs,0 bargs and 1 iargs. Inputs: [(FLOAT,[1024,6],c), (FLOAT,[6],c), (FLOAT,[1024,6],c)]. Outputs: [(FLOAT,c)]. tArgs: [0.0]. iArgs: [3]. bArgs: -. Input var names: [out, labelWeight, label]. Output var names: [loss] - Please see above message (printed out from c++) for a possible cause of error.
java.lang.RuntimeException: NDArray::tile method - shapeInfo of target array is not suitable for tile operation !
at org.nd4j.linalg.cpu.nativecpu.ops.NativeOpExecutioner.exec(
at org.nd4j.linalg.factory.Nd4j.exec(
at org.nd4j.autodiff.samediff.internal.InferenceSession.doExec(
at org.nd4j.autodiff.samediff.internal.InferenceSession.getOutputs(
at org.nd4j.autodiff.samediff.internal.TrainingSession.getOutputs(
at org.nd4j.autodiff.samediff.internal.InferenceSession.getOutputs(
at org.nd4j.autodiff.samediff.internal.AbstractSession.output(
at org.nd4j.autodiff.samediff.internal.TrainingSession.trainingIteration(

@SidneyLann that’s coming from the c++. That error is here:

Could you give me more context? I appreciate all the issues you’re finding, I’ll take a look at those for the release QA. Currently I’m still finishing up WIP: Add new model import framework by agibsonccc · Pull Request #554 · KonduitAI/deeplearning4j · GitHub
which will add an actual set of op descriptors in protobuf similar to tf/onnx as well as allowing the definition of custom rules for model import so model import isn’t a black box anymore.

@agibsonccc , I created a PR for this issue. I didn’t test it by running the unit tests because after like 2 hours of struggling with the environment setup I gave up. However I used those changes on my own overridden version of ArrayCacheMemoryMgr and it works.

Link to PR: Fixed object's removal in ArrayCacheMemoryMgr by partarstu · Pull Request #9155 · eclipse/deeplearning4j · GitHub

@partarstu thanks looks good. I merged your PR.

Thank you @agibsonccc