BertIterator produces NPE while training on GPU

Hi there

I’m trying to fine-tune a pre-trained BERT model in DL4J using SameDiff and the BertIterator. I run into the following NullPointerException:

Exception in thread "AMDSI prefetch thread" java.lang.RuntimeException: java.lang.NullPointerException
        at org.nd4j.linalg.dataset.AsyncMultiDataSetIterator$AsyncPrefetchThread.run(AsyncMultiDataSetIterator.java:389)
Caused by: java.lang.NullPointerException
        at org.nd4j.jita.concurrency.CudaAffinityManager.ensureLocation(CudaAffinityManager.java:307)
        at org.nd4j.linalg.dataset.callbacks.DefaultCallback.call(DefaultCallback.java:65)
        at org.nd4j.linalg.dataset.AsyncMultiDataSetIterator$AsyncPrefetchThread.run(AsyncMultiDataSetIterator.java:364)

The specific line of code in CudaAffinityManager::ensureLocation looks like the following:

@Override
public void ensureLocation(INDArray array, Location location) {
    // to location to ensure for empty array
    if (array.isEmpty() || array.isS()) // HERE THE NPE HAPPENS
        return;
    ...

In the case of the NPE the array is Null. The array that is checked here, is the feature mask of the segment id array. When I run the same code on CPU, the problem does not arise, since the CpuAffinityManager does not overwrite the NOP from the BasicAffinityManager.

The feature mask of the segment index array is null because of this code snipped in the BertIterator:

private Pair<INDArray[], INDArray[]> convertMiniBatchFeatures(List<Pair<List<String>, String>> tokensAndLabelList, int outLength, long[] segIdOnesFrom) {
    // some code
    ...

    if (featureArrays == FeatureArrays.INDICES_MASK_SEGMENTID) {
        outSegmentIdArr = Nd4j.createFromArray(outSegmentId);
        f = new INDArray[]{outIdxsArr, outSegmentIdArr};
        fm = new INDArray[]{outMaskArr, null}; // HERE THE MAGIC HAPPENS...
    } else {
        f = new INDArray[]{outIdxsArr};
        fm = new INDArray[]{outMaskArr};
    }
    return new Pair<>(f, fm);
}

What might be the reason / fix for this problem?

  1. Segment index array should be masked as the token index array?
  2. CudaAffinityManager should have additional check of “array == null”
  3. Some other reason

Thank you very much!

Cheers Nino

I think this might be a proper bug.I think it should check for the null-ness either before checking the location or in ensureLocation.

@raver119 does ensureLocation expect array to always be non-null?

Thank you, for the response.

As a quick fix, I’ve overwritten the BertIterator and add a “ones”-mask for outSegmentIdArr as well.

  @Override
  public MultiDataSet next(int num) {
    MultiDataSet next = super.next(num);

    INDArray[] featuresMaskArrays = next.getFeaturesMaskArrays();
    featuresMaskArrays[1] = featuresMaskArrays[0];

    return next;
  }

Might not be the final solution but it does the trick for now.

@raver119 What would you think about this problem? Would the non-null fix be a permanent solution?