Multi-GPU: Exception when storing Model using SameDiff

Crispy · November 25, 2021, 1:29pm

Hello

We fine-tune a BERT model based on a TensorFlow frozen graph using SameDiff. This happens on a system with two GPUs.

After the training for the fine-tuning is successfully completed, we try to persist the resulting model using SameDiff.save(modelStream, false);

This results in an error of an unsupported data type with the following stack trace:

java.lang.UnsupportedOperationException: Unsupported data type used: UTF8
org.nd4j.linalg.factory.Nd4j.scalar(Nd4j.java:4977)
org.nd4j.linalg.factory.Nd4j.create(Nd4j.java:4279)
org.nd4j.linalg.util.DeviceLocalNDArray.get(DeviceLocalNDArray.java:68)
org.nd4j.autodiff.samediff.array.ThreadSafeArrayHolder.getArray(ThreadSafeArrayHolder.java:53)
org.nd4j.autodiff.samediff.SameDiff.getArrForVarName(SameDiff.java:742)
org.nd4j.autodiff.samediff.SDVariable.getArr(SDVariable.java:138)
org.nd4j.autodiff.samediff.SDVariable.getArr(SDVariable.java:120)
org.nd4j.autodiff.samediff.SameDiff.asFlatBuffers(SameDiff.java:4803)
org.nd4j.autodiff.samediff.SameDiff.asFlatBuffers(SameDiff.java:4773)
org.nd4j.autodiff.samediff.SameDiff.asFlatBuffers(SameDiff.java:4995)
org.nd4j.autodiff.samediff.SameDiff.asFlatFile(SameDiff.java:5109)
org.nd4j.autodiff.samediff.SameDiff.save(SameDiff.java:5009)
org.nd4j.autodiff.samediff.SameDiff.save(SameDiff.java:5029)
com.bsiag.ml.cortex.learn.classification.bert.BertClassifierCortex.modelToStream(BertClassifierCortex.java:333)
com.bsiag.ml.engine.mindset.CortexMindset.exportModel(CortexMindset.java:376)

Cause for this is that, within org.nd4j.linalg.util.DeviceLocalNDArray.get(DeviceLocalNDArray.java:68), sourceId and deviceId are sometimes unequal, with sourceId (to my understanding the device where the variable contents reside in memory) = 0 and deviceId (where the current thread runs) = 1. This should, by itself, not be a problem because the intent of this method seems exactly to be to get the data to the device local to the currently running thread.

However, there is at least one data item being an UTF-8 scalar, which was initially loaded as part of the TensorFlow model (protocol buffer *.pb file generated with the freezeTrainedBert.py from here on the basis of the “BERT-Base, Multilingual Cased” model downloadable on Google’s github) and which is now not transferrable to the other device in this way, because UTF-8 scalars are not supported.

As such, the unsupported data type as such is just an implementation decision on what to support. However, in the whole context, it looks to me like a bug in the design concerning the handling of multiple GPU devices being present.

I can supply the protobuf file if it helps.

Also, any ideas for a workaround on this? We tried limiting to one GPU by setting ND4J_CUDA_FORCE_SINGLE_GPU=true. But this does not help.

Regards, Crispy

agibsonccc · November 25, 2021, 1:45pm

@Crispy The issue looks simpler than what you’re describing. that appears to be trying to create a scalar of type string. Could you file an issue? I don’t see why we shouldn’t allow that. Beyond that, anything you can give me to help me reproduce it (preferably code + model if needed over DMs) would be nice.

Crispy · November 26, 2021, 8:01am

Thanks, @agibsonccc. I’ll create an issue.

Crispy · November 29, 2021, 8:40am

Created here

Topic		Replies	Views
Some error happened when I importFrozenTF by SameDiffI SameDiff	6	658	August 14, 2021
Error on saving large SameDiff model with FlatBuffers SameDiff	5	948	August 19, 2023
UnsupportedOperationException while training SameDiff	4	525	August 24, 2021
NPE when load TF to SameDiff SameDiff	22	639	December 3, 2021
Importing BERT fails with Unable to find name dataType for op name: "tensorarrayv3" SameDiff	8	483	February 3, 2022

Multi-GPU: Exception when storing Model using SameDiff

Related topics