I would like to convert the CLIP (Contrastive Language-Image Pre-Training) model, so that it can be used in a Java program for zero-shot image classification. I not familiar with the dl4j ecosystem as I am with python and pytorch, so this question is also a question of what tools I should look more into. I have some Java experience.
I wonder if I can somehow convert the model into ONNX as an intermediate, then convert it into DL4J. I’ve heard there is no existing way to import a transformer model, and since CLIP uses a vision transformer, I’m wondering if that affects this or makes this difficult; as long as it is not impossible or prohibitively time-consuming I am willing to try it. Does anyone else have experience in this kind of thing? Or tell me what steps can I take to achieve this goal? Thank you in advance!