With deep learning progressing so quickly, it’s been ages since the last stable update. I can’t wait for the new stable release.
@cqiaoYc agreed. You can see my working branch if you want.
I push here, publish snapshots from here and split out PRs from that as a second form of review.
There’s new LLM modules that are loading qwen, docling via ggml import and other relevant components on snapshots now. Unfortunately the reality is, most people use pytorch or llama.The framework if it’s going to be useful to anyone at this point needs to import and run those models as is with no changes with equivalent performance.
I’ve just now gotten the infra to a point where I can actually ship something since ccache actually allows builds to finish.
With that it’s actually going to be much easier for me to ship something.
I’ve also had to work on new infra for making the framework competitive in a way that will actually scale. In this case I’ve been spending time on new MLIR based compiler infra that hooks in to modules published by hardware vendors. That will allow less hand written code underneath while allowing innovation in packaging/ease of use which is where I believe the framework can have some value.
The other problem to be brutally honest: people don’t contribute. They don’t pay. I have a few contracts for specific parts of the framework but beyond that the main use cases are dictated by what my company does. So I’m left with a few questions:
-
What will keep the framework relevant to the point that people would care?
-
How do I avoid people needing to learn the api?
-
What makes commercial sense to develop? In this case, it’s a UI based product.
Not that I’m expecting anyone to care about the product but I just want to be clear what keeps the framework going.
At this point if I release a stable release I want it to be well tested. I’ve been able to chip at problems but have had to focus my testing on my use cases not on all the broader needs that people will have like documentation, usage of the older api. Different backends. Luckily I feel pretty good about this and mainly need to see all tests passing. I plan on chipping at that next.
Your work is truly remarkable. DL4J offers one of the few production-grade open-source frameworks for deep learning models natively built in Java, and this is precisely where its greatest value lies. How open-source projects can monetize and generate revenue to sustain their development is a challenging problem—especially in the context of deep learning frameworks, where PyTorch holds an absolute dominant position.
In my personal opinion, it will be difficult for DL4J to gain an edge in building large language models (LLMs). Instead, providing high-quality Java wrappers for the best open-source LLMs and developing specialized small-to-medium models tailored to specific domains may be a better strategic choice. Documentation is of paramount importance, yet we can leverage LLMs to automatically generate documentation as much as possible to reduce human workload.
I’ve been following this project since sometime in 2016, and I’ve watched it develop and then stall. It feels like a real shame. As an enthusiast, I first got into deep learning using Python and DL4J. After reading what you said about making it actually useful to others and how it should be used, combined with my own development experience, I have two points:
-
It should be able to import models from the Python ecosystem, such as PyTorch .pt format, similar to the DJL library, and also ONNX models. For this part, you can just wrap it like DJL does and borrow its code and ideas.
-
We’re now in the AI era. Going from Python to DL4J involves changes in coding style, but that’s not a big issue. As long as tensor operations are implemented, the glue code can be written easily with AI assistance. The core execution flow of the model just needs to stay consistent.
@deepleanring4j I’m looking in to doing something in the next few months. For now I’m publishing snapshots to get people rolling. Unfortunately what’s gone on is a complete rewrite of the internals, slimming down the framework getting rid of a massive amount of unused features, and focusing the project on fixing some of its fundamental architectural flaws. Let me give you an idea:
- Competition’s out there. It’s not the same as it was when dl4j launched. DJL exists, onnxruntime java exists, pytorch javacpp exists, Llama’s out there for LLMs
- The brutal reality is “what’s the point of the framework” when you have pytorch? It had a few niches with keras import but the brutal reality is the framework never was as popular as a lot of the python ones.
- Given that, I decided to take the time to actually properly update the framework to suit my commercial needs in a way that would allow me to serve the pre existing customer base that actually pays for some of the development of the framework, plus my own needs that would still make it a good foundational tool for products I build. I’ve worked on the framework since 2013 and the ecosystem has evolved a lot.
- So the baseline feature set has to be: ZERO switching costs, comparable performance, a sustainable feature set that matches what people need today.
- So what does that break down to: ggml, safetensors, full onnx import bare minimum.
- Comparable performance: how do we do that without rewriting the c++ code base? Turns out the answer was a new function called DSP. What that is is essentially onnx style pre computation plus deep learning compiler tech using triton and MLIR to run kernels.
- The other: large models. We need model sharding to support LLMs a single flatbuffers file looked great for samediff at the time but doesn’t make sense now.,
- Where are we now? The great news is a lot of this is done in some form and I’m actively testing it in my actual products.
- The idea now is these modules are all out there now in some form and I’m just trying to get to a point with a feature set that I’m happy with and usable in my commercial product. What I will do in that time is also do a release focusing on samediff. The great thing is the pre existing word vectors, datavec, and other modules will still be supported. As will the old dl4j api thanks to keras import.
- That SHOULD lead to a new time where the framework at least becomes usable to people but doesn’t get trapped in the framework wars of yesteryear. I’m not playing a marketshare game anymore. I’m building a tool I want to see exist, ensure it’s relevant and have it be my opinion on what the python world has in spades.
Now as for what’s prevented normal releases in the mean time:
Besides the product differentiation question and the commercial efforts, honestly: build times. Github actions has a 6 hour window. Doing a release was an event every time due to long compile times. I’ve just now gotten proper ccache based infra that actually allows for reasonable compile times in c++ allowing me to ship on different platforms. That took a lot of work. Compiling for 15 or so different architecture operating system combinations was painful. Cuda especially usually hit that 6 hour window. I’ was dealing with a lot of time outs and to be blunt don’t have the budget or time for anything but a github actions based setup.YEARS ago we had a custom jenkins but managing that infra not to mention al the different platforms was just painful. GH actions has gotten a lot better and allowed me to actually build for arm macs, android, as well as linux-arm directly without cross compilation. That solved a lot of headaches.
Hopefully that makes sense. I’m hoping to wrap that up in the next few months now. Performance and test coverage are about at a place I”m happy with. I will still need to do examples and other follow up work for documentation plus examples for the new features though.
With some of the bets I made I’ve made on MLIR I’m seeing something I’m at least somewhat happy with. Thanks for being patient with me.