Importing a pytorch model

Good day. We are interested in importing a pytorch model (specifically, a Detectron2 Mask-RCNN model) into DL4J.

I have found the documentation on importing a keras model, and I have just come across SameDiff, which purports to be able to “import pre-existing models from other frameworks”, but I cannot find examples for Pytorch (maybe I must look deeper), and so it is not clear to me whether it is possible to load our Detectron2 model into DL4J.

Any words of wisdom will be greatly appreciated. Thanks in advance.

samediff-import-onnx can help

@mdebeer export your op to onnx and I would be happy to take a look at your case. I wrote a new guide here on the core workflow to follow: The core workflow - Deeplearning4j

A basic explanation can be found here:

One issue we noticed that will be fixed in the next release (and soon snapshots) is needing to include the resources defined here:

in your project. We found out there was an exclusion of pbtxt resources we missed when building the import module.

Thank you for the quick feedback and references. I’ll go through the guides and get back to you

Aah, I first had to navigate a variety of Python errors to get the ONNX export working.

So, the problems were specific to Detectron, and I’ll describe here to help future Googlers:
In the python package, onnx.optimizer has been removed from onnx. This is a dependency for detectron2 (0.5) and torch (1.9.0), and rolling back to onnx==1.8.1, which includes the optimizers, does not solve the problem.

I posted an example of Detectron2 ONNX model export code here, discussing how I solved the onnx optimizer error.

Before solving that, I was first getting a non-descript CUDA error, device side assert triggered, which I could circumvent by setting the device to ‘cpu’, after training with GPU. Presumably onnx / cuda 11.2 incompatibility.

Next it was a hassle finding the right dummy input type. Sample errors include basic type errors like "string indices must be integers, and “too many indices for tensor of dimension 2”, as I was trying to input a numpy image, which had to be converted to a torch Tensor. Example of how can be found in link above.

Anyway, it turned out to be quite an onion of python errors to first get out of the way, and I’ve finally managed to export the Detectron model to the onnx format. Now to test importing to dl4j …

@mdebeer sorry to hear that. Thanks for specifying the docs there. I would recommend the desktop app netron for analyzing graphs.

If you have any other issues like that, please let me know. Your pain here tells me we should be writing an export guide. Could you elaborate on anything else you ran in to?

Thank you @agibsonccc - I hadn’t heard of Netron before and will start using it.

An export guide would definitely be handy, but I can see how tricky that will be with the number of frameworks and platforms to cover. Tensorflow and Pytorch probably have decent documentation for it already; the Detectron guide was not clear to me at all.

I’m currently trying to instantiate new OnnxFrameworkImporter() from org.nd4j.samediff.frameworkimport.onnx.importer, and I run into the following exception:

org.nd4j.samediff.frameworkimport.opdefs.OpDescriptorLoader: Provider org.nd4j.samediff.frameworkimport.onnx.opdefs.OnnxOpDescriptorLoader could not be instantiated
...
Caused by: java.io.FileNotFoundException: nd4j-op-def.pbtxt cannot be opened because it does not exist

Can’t seem to find any references to it online, besides this issue linking to the .pbtxt file. Isn’t that in the class resources, or must I manually include it somehow?

I’m using sbt with:

"org.deeplearning4j" % "deeplearning4j-core" % dl4jVersion,
"org.deeplearning4j" % "deeplearning4j-modelimport" % dl4jVersion,
"org.nd4j" % "samediff-import-onnx" % dl4jVersion,
"org.nd4j" % "nd4j-cuda-11.2" % dl4jVersion,
"org.nd4j" % "nd4j-native-platform" % dl4jVersion

among others, with version = 1.0.0-M1.1

Aah, I now see in your initial response, you mentioned:

there was an exclusion of pbtxt resources we missed when building the import module.

and that it will be fixed in the next release.
Fortunately, I am on holiday for 2 weeks after tomorrow, so, happy to be patient. :grinning_face_with_smiling_eyes:

@mdebeer I would recommend following my advice and just putting the relevant bits in your src/main/resources. It only takes a second and will allow us to potentially fix issues with your model before the release.

Thank you, I figured it may be as simple to add the files to resources.

I can now instantiate the importer with new OnnxFrameworkImporter(), and loading the graph with .runImport prints a long string with what I presume is all the model parameters:

[_wrapped_model.backbone.fpn_lateral2.weight, _wrapped_model.backbone.fpn_lateral2.bias, _wrapped_model.backbone.fpn_output2.weight, _wrapped_model.backbone.fpn_output2.bias, _wrapped_model.backbone.fpn_lateral3.weight, _wrapped_model.backbone.fpn_lateral3.bias, _wrapped_model.backbone.fpn_output3.weight, _wrapped_model.backbone.fpn_output3.bias, _wrapped_model.backbone.fpn_lateral4.weight, _wrapped_model.backbone.fpn_lateral4.bias, _wrapped_model.backbone.fpn_output4.weight, _wrapped_model.backbone.fpn_output4.bias, _wrapped_model.backbone.fpn_lateral5.weight, _wrapped_model.backbone.fpn_lateral5.bias, _wrapped_model.backbone.fpn_output5.weight, _wrapped_model.backbone.fpn_output5.bias, _wrapped_model.proposal_generator.rpn_head.conv.weight, _wrapped_model.proposal_generator.rpn_head.conv.bias, _wrapped_model.proposal_generator.rpn_head.objectness_logits.weight, _wrapped_model.proposal_generator.rpn_head.objectness_logits.bias, _wrapped_model.proposal_generator.rpn_head.anchor_deltas.weight, _wrapped_model.proposal_generator.rpn_head.anchor_deltas.bias, _wrapped_model.roi_heads.box_head.fc1.weight, _wrapped_model.roi_heads.box_head.fc1.bias, _wrapped_model.roi_heads.box_head.fc2.weight, _wrapped_model.roi_heads.box_head.fc2.bias, _wrapped_model.roi_heads.box_predictor.cls_score.weight, _wrapped_model.roi_heads.box_predictor.cls_score.bias, _wrapped_model.roi_heads.box_predictor.bbox_pred.weight, _wrapped_model.roi_heads.box_predictor.bbox_pred.bias, _wrapped_model.roi_heads.mask_head.mask_fcn1.weight, _wrapped_model.roi_heads.mask_head.mask_fcn1.bias, _wrapped_model.roi_heads.mask_head.mask_fcn2.weight, _wrapped_model.roi_heads.mask_head.mask_fcn2.bias, _wrapped_model.roi_heads.mask_head.mask_fcn3.weight, _wrapped_model.roi_heads.mask_head.mask_fcn3.bias, _wrapped_model.roi_heads.mask_head.mask_fcn4.weight, _wrapped_model.roi_heads.mask_head.mask_fcn4.bias, _wrapped_model.roi_heads.mask_head.deconv.weight, _wrapped_model.roi_heads.mask_head.deconv.bias, _wrapped_model.roi_heads.mask_head.predictor.weight, _wrapped_model.roi_heads.mask_head.predictor.bias, 1020, 1021, 1023, 1024, 1026, 1027, 1029, 1030, 1032, 1033, 1035, 1036, 1038, 1039, 1041, 1042, 1044, 1045, 1047, 1048, 1050, 1051, 1053, 1054, 1056, 1057, 1059, 1060, 1062, 1063, 1065, 1066, 1068, 1069, 1071, 1072, 1074, 1075, 1077, 1078, 1080, 1081, 1083, 1084, 1086, 1087, 1089, 1090, 1092, 1093, 1095, 1096, 1098, 1099, 1101, 1102, 1104, 1105, 1107, 1108, 1110, 1111, 1113, 1114, 1116, 1117, 1119, 1120, 1122, 1123, 1125, 1126, 1128, 1129, 1131, 1132, 1134, 1135, 1137, 1138, 1140, 1141, 1143, 1144, 1146, 1147, 1149, 1150, 1152, 1153, 1155, 1156, 1158, 1159, 1161, 1162, 1164, 1165, 1167, 1168, 1170, 1171, 1173, 1174, 1176, 1177, 1179, 1180, 1182, 1183, 1185, 1186, 1188, 1189, 1191, 1192, 1194, 1195, 1197, 1198, 1200, 1201, 1203, 1204, 1206, 1207, 1209, 1210, 1212, 1213, 1215, 1216, 1218, 1219, 1221, 1222, 1224, 1225, 1227, 1228, 1230, 1231, 1233, 1234, 1236, 1237, 1239, 1240, 1242, 1243, 1245, 1246, 1248, 1249, 1251, 1252, 1254, 1255, 1257, 1258, 1260, 1261, 1263, 1264, 1266, 1267, 1269, 1270, 1272, 1273, 1275, 1276, 1278, 1279, 1281, 1282, 1284, 1285, 1287, 1288, 1290, 1291, 1293, 1294, 1296, 1297, 1299, 1300, 1302, 1303, 1305, 1306, 1308, 1309, 1311, 1312, 1314, 1315, 1317, 1318, 1320, 1321, 1323, 1324, 1326, 1327, 1329, 1330]

but then throws exception:

Exception in thread "main" java.lang.NullPointerException
	at org.nd4j.samediff.frameworkimport.onnx.ir.OnnxIRGraph.nodeList(OnnxIRGraph.kt:123)
	at org.nd4j.samediff.frameworkimport.onnx.ir.OnnxIRGraph.<init>(OnnxIRGraph.kt:88)
	at org.nd4j.samediff.frameworkimport.onnx.importer.OnnxFrameworkImporter.runImport(OnnxFrameworkImporter.kt:49)
	at TestImportONNX$.delayedEndpoint$TestImportONNX$1(TestImportONNX.scala:13)
...

presumably because I am providing Collections.emptyMap() as input with the .runImport function, as shown in the model-import-framework guide.

I haven’t delved so deep into dl4j before – could you please advise on how to input image formats to the model?

@mdebeer much appreciated and I am glad you had time to give some feedback. Could you give me a copy of the onnx model? I can see if there is potentially a bug to fix. I’ve mainly tested on main stream cnn architectures and have not seen this before. Either way, a better error message should be in order there.

From the looks of it, the error is here: deeplearning4j/OnnxIRGraph.kt at master · eclipse/deeplearning4j · GitHub

When looking at that that means there is a null name in the input. Ideally, all graphs should have inputs or placeholders. There maybe additional manipulation we need to do to the model itself.

That would make sense if the input names are missing… I noticed that in the torch model exporter, there are arguments to specify names for input/output nodes.
For interest, here’s the torch.onnx.export definition:

def export(model, args, f, export_params=True, verbose=False, training=TrainingMode.EVAL,
           input_names=None, output_names=None, aten=False, export_raw_ir=False,
           operator_export_type=None, opset_version=None, _retain_param_name=True,
           do_constant_folding=True, example_outputs=None, strip_doc_string=True,
           dynamic_axes=None, keep_initializers_as_inputs=None, custom_opsets=None,
           enable_onnx_checker=True, use_external_data_format=False)

Those name arguments default to None, and using the Detectron exporter api, it does not take arguments for those fields. So one possibility is to manually pass node names to the torch exporter, and see if that makes a difference with the model import. I’ll see if I can test that quickly.

I’m uploading the ONNX model to Dropbox (430MB) for you to check out – I’ll send a message to you with the share link.

@mdebeer could you send a png of the model rendered in netron or the model itself so I can explore this? Thanks!

Perfect, png is way better. Model upload was getting stuck.
Image is too large to upload here, so here’s a dropbox link.

It does appear to have a named input node. Here’s closeup of input model properties:

@mdebeer that’s interesting…so there is an input there but we’re not picking that up. That confirms my worst fears. I know I’m being a pain here, but now I want that model even more. Even if it is a pain, I think it will save us both time to let me just use what you have now. Let me DM you to coordinate a way that works for both of us.

Great, I’ll send it through now. Not a pain at all, I really appreciate the prompt feedback and assistance.

@mdebeer thanks for the model, I’ll implement a workaround. Onnx is exporting an op called AliasWithName which is not even a valid op in the onnx spec:

Other people are facing the same issue apparently: "AliasWithName is not a registered function/op" when run converted onnx model · Issue #1868 · facebookresearch/detectron2 · GitHub
I also fixed the opaque error so this will be more obvious in the future. Thanks!