Importing GRU onnx model failed

ccen · February 20, 2023, 3:23am

Hello all, I’m trying to import a PyTorch generated GRU onnx model into dl4j and got errors.
I will show all my processes and my speculation about the error.

(1) Generate GRU onnx model with PyTorch
Follow by PyTorch document GRU and onnx, I wrote a simple script.

import torch
rnn = torch.nn.GRU(10, 20, 2)
input = torch.randn(5, 3, 10)
h0 = torch.randn(2, 3, 20)
torch.onnx.export(rnn, (input, h0), "Single_GRU.onnx")

(2) Import to dl4j, got import error

OnnxFrameworkImporter onnxFrameworkImporter = new OnnxFrameworkImporter();
SameDiff graph = onnxFrameworkImporter.runImport(f.getAbsolutePath(), Collections.emptyMap(), true);

Exception in thread "main" java.lang.IndexOutOfBoundsException: Index 2 out of bounds for length 2
	at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)

(3) Modify onnx file, got another error
I found IndexOutOfBoundsException occurs while accessing GRU onnx node outputs, so I wrote a python script add more outputs to GRU node, which makes an off-standard onnx file.

import onnx
model = onnx.load(r"Single_GRU.onnx")
graph = model.graph
node = graph.node

for i in range(len(node)):
    if node[i].op_type == "GRU":
        for j in range(2,4):
            node[i].output.insert(j, node[i].name + "_output_placeholder_" + str(j))
        print(node[i])
        print("=====")
onnx.save(model, r"Single_GRU_addoutput.onnx")

Then I got another error.

Error at [D:/a/deeplearning4j/deeplearning4j/libnd4j/include/ops/declarable/generic/nn/recurrent/gruCell.cpp:97:0]:
gruCell: Input ranks must be 2 for inputs 0 and 1 (x, hLast) - got 3, 3
Exception in thread "main" java.lang.RuntimeException: Op gruCell with name /GRU failed to execute. Here is the error from c++: Op validation failed
	at org.nd4j.linalg.cpu.nativecpu.ops.NativeOpExecutioner.calculateOutputShape(NativeOpExecutioner.java:1486)

agibsonccc · February 20, 2023, 3:57am

@ccen what pytorch version are you using? Try not to generate random 1 off files. If it doesn’t work out of the box then we should fix it. I’ve sometimes found the opset version matters, see our example here:

github.com

deeplearning4j/deeplearning4j/blob/ab43b47a19638f19edf93f39bea50ce86c3f2b60/contrib/omnihub/src/omnihub/frameworks/pytorch.py#L75


      
          
          
def download_model(self, model_path, **kwargs) -> str:
              model = None
              height = kwargs.get('height', MODEL_DEFAULTS[model_path]['height'])
              width = kwargs.get('width', MODEL_DEFAULTS[model_path]['width'])
              x = torch.from_numpy(np.ones((1, 3, height, width), dtype=np.float32))
              if model_path in detection_models:
                  model = models.detection[model_path](pretrained=True, **kwargs)
              else:
                  model = models.__dict__[model_path](pretrained=True, **kwargs)
              torch.onnx.export(model,
                                x,
                                f'{framework_dir}/{model_path}.onnx',
                                export_params=True,
                                do_constant_folding=False,
                                opset_version=13,
                                **kwargs)
          
          
def stage_model(self, model_path: str, model_name: str):
              super().stage_model(model_path, model_name)

If you still have issues after updating your call please do file an issue. Whatever it is should be a quick fix. Thanks!

agibsonccc · February 20, 2023, 4:01am

@ccen also of note…GRU was one of the first ones we implemented: deeplearning4j/OnnxOpDeclarations.kt at e5218991026880a4b1ae09dff0750f149469f80c · deeplearning4j/deeplearning4j · GitHub

This should have been a fairly straightforward import mapping 1 to 1. The version can really matter here.

ccen · February 20, 2023, 4:45am

As a new user, I can only put 2 links in a post. So my message is divided into segments.

(4) My speculation about the error
The error message comes from gruCell.cpp#L97.

In dl4j 1.0.0-beta7 document, I found dl4j has gru and gruCell classes, gruCell does a single time step operation, so it requires inputs rank 2.

ccen · February 20, 2023, 4:46am

My speculation is dl4j uses gruCell operator on onnx gru node, rather than gru operator. Which causes error in (2) and (3). One suspect is in nd4j, onnx name of gruCell is “GRU”.

ccen · February 20, 2023, 4:51am

My PyTorch is 1.13.1, dl4j version is 1.0.0-SNAPSHOT, actually nd4j-cpu-backend-common-1.0.0-20230209.002746-392.jar

Yes opset version often matters, but not in this case. I’ve use different opset version, and the script above is just for brief intro. I’ve file a issue at github.

ccen · February 20, 2023, 4:54am

I don’t think gruCell should map to onnx GRU node, “gruCell does a single time step operation”.
Onnx GRU operator define as here.

agibsonccc · February 20, 2023, 5:31am

@ccen we have a gru layer as well…let me revisit this.

Topic		Replies	Views
Importing a pytorch model SameDiff	54	2179	February 28, 2022
ONNX import fails SameDiff	9	447	October 13, 2021
What does it take to import CLIP into Java? SameDiff	5	443	June 15, 2023
Import Pytorch model in ONNX format SameDiff	6	298	September 21, 2023
Importing a Keras model into DL4J DL4J	5	425	June 8, 2023

Importing GRU onnx model failed

Related topics