What does it take to import CLIP into Java?

Hi,

I would like to convert the CLIP (Contrastive Language-Image Pre-Training) model, so that it can be used in a Java program for zero-shot image classification. I not familiar with the dl4j ecosystem as I am with python and pytorch, so this question is also a question of what tools I should look more into. I have some Java experience.

I wonder if I can somehow convert the model into ONNX as an intermediate, then convert it into DL4J. I’ve heard there is no existing way to import a transformer model, and since CLIP uses a vision transformer, I’m wondering if that affects this or makes this difficult; as long as it is not impossible or prohibitively time-consuming I am willing to try it. Does anyone else have experience in this kind of thing? Or tell me what steps can I take to achieve this goal? Thank you in advance!

@kper22020 can you point to a model you want to import? I can take it for a spin and see what’s needed. Either TF, keras or, onnx format is fine.

Thank you for the reply! I am referring to the model mentioned in the CLIP github repository: GitHub - openai/CLIP: CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

Specifically, the model and preprocessors corresponding to the checkpoint ViT-L/14 is what I’m hoping to import.

@kper22020 can you point me to an onnx model? There’s an example of how to do this with just pytorch here:

There’s also: Export to ONNX

note that you’ll also need to port the tokenizer, which isn’t directly importable that way

@agibsonccc @treo Sorry I haven’t gotten this far, I’ll check out how to export the model and the tokenizer in an onnx format then,