How can I turn a network into stub codes?

I have trained a multi-layer network to recognize some specific soundwaves.It worked perfectly well,and costed only ~1ms to work.However,when I tried to migrate it to Android,I found to my astonishment that the apk is over 1GB.I checked my moudle and found that it was merely 50kb.Is there a way that I can turn this moudle into,for instance,java code that works without DL4J?I don’t care performance loss since the soundwave lengths 10secs each,and I have 10secs to waste.But a 1 GB sized apk is just not appealing!

Aftre getting replied via Github and have tried deeplearning4j-nn,the size of the apk is now 154MB,but it is still to large for me.The ideal size that works for me is around 10MB or smaller,for it is just a camera which allows users to take pictures by saying “Cheers!” .
Just,maybe including the library is not the option for me.

@huzpsb are you willing to give reducing the size of the ops a try? You should be able to reduce the binary size if you’re willing to do a bit of compilation work. If you can post your pom we can also take a better look as well.


<?xml version="1.0" encoding="UTF-8"?>
  ~ This program and the accompanying materials are made available under the
  ~  terms of the Apache License, Version 2.0 which is available at
See the NOTICE file distributed with this work for additional
information regarding copyright ownership.
  ~   See the NOTICE file distributed with this work for additional
  ~   information regarding copyright ownership.
  ~  Unless required by applicable law or agreed to in writing, software
  ~  distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
  ~  WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
  ~  License for the specific language governing permissions and limitations
  ~  under the License.
  ~  SPDX-License-Identifier: Apache-2.0

<project xmlns:xsi=""

    <!-- Group-ID, artifact ID and version of the project. You can modify these as you want -->

    <!-- Properties Section. Change ND4J versions here, if required -->

        <!-- deeplearning4j-core: contains main functionality and neural networks -->

        ND4J backend: every project needs one of these. The backend defines the hardware on which network training
        will occur. "nd4j-native-platform" is for CPUs only (for running on all operating systems).

        <!-- CUDA: to use GPU for training (CUDA) instead of CPU, uncomment this, and remove nd4j-native-platform -->
        <!-- Requires CUDA to be installed to use. Change the version (8.0, 9.0, 9.1) to change the CUDA version -->

        <!-- Optional, but recommended: if you use CUDA, also use CuDNN. To use this, CuDNN must also be installed -->
        <!-- See: -->


            <!-- Maven compiler plugin: compile for Java 8 -->

            Maven shade plugin configuration: this is required so that if you build a single JAR file (an "uber-jar")
            it will contain all the required native libraries, and the backends will work correctly.
            Used for example when running the following commants

            mvn package
            cd target
            java -cp deeplearning4j-examples-1.0.0-beta-bin.jar org.deeplearning4j.LenetMnistExample
                <!--suppress MavenModelInspection -->



After making java cpp platform arch-only ,I’ve reached 90MB now.Still isn’t it ideal for me :confused:

@huzpsb yeah of course it’s why I keep alluding to reducing the size of the binary. You can also reduce the size of the native library inside by only compiling for the ops you need. I would need to take a look at your network but I might be able to help you figure out how to use that feature.

Essentially the steps would be as follows:

  1. Compile the c++ library with only the ops you want using something like:
mvn -Dlibnd4j.operations="add,subtract,..."

If you look in there, the native binary with all ops is around 125 MB. That should help quite a bit if we can figure out what you’re using. That was just introduced recently and most folks don’t want to compile from source usually.

That is how it’s usually done in a lot of mobile apps though. Basically c header file + a small native shim. This will basically be a custom “backend” which really is just the native code compiled a certain way + the native shim.

The moudle:

About compiling the source:
Well,I suppose I am a part of the folk :\

More or less,consider adding a pretrained moudle to java bytecode compiler to dl4j?
I’ve seen people converting moudles into stub codes,though it’s pytorch into llvm.

@huzpsb yeah we sort of do. That would take the form of a binary you would load. I’m interested in seeing how you would expect this to work though. Anything that would give a good user experience we are open to.

Essentially our off the shelf binaries bundle armcompute (faster math for arm processors) for various ops and are compiled with android’s LLVM. What we would want to do is run that same process but include only the ops necessary for your app.

The form that would take is a custom nd4j backend that you would include in your project.
A “backend” is the code I mentioned earlier.

Java byte code wouldn’t really allow us to benefit from the faster math routines which are unfortunately important for ML performance. That’s why we don’t bother with pure java. We had that years back but it never could really keep up with LLVM and the like. Let me take a look at your model to see what I can do in the mean time.

Sorry it’s not the ideal UX and I appreciate your input here.

May I know the git version?I will use the legacy version if that’s the only way.

Well,I fully understand the importance of performance in ML.But please,do consider the size.Truly,in many cases,performance is the god.But there are always exceptions.In my case,I need only to recognize the sound in 5 secs(I cut the sound per 5 secs and process them in the next 5 secs while recording the new clip).It makes no difference by making it 0.01 sec(which it actually takes).But it really makes a difference whether my app sizes 15MB or 7MB.It also makes a difference whether it can run on only 90% of devices (By reducig the natives) or ALL the devices (By compiling it in to java)
Yes,it isn’t the usual case.But that’s what I am facing.I’d be greatful if you can understand.

If you insist that this compiler can be misleading in some ways,please just take it back while marking it “deprecated”.

@huzpsb sorry not supporting older versions like that. We got rid of that older code for a reason. It was slow. In order to be competitive, the compiler approach like you’ve discussed is the only viable method.

You’re on your own if you go that route and we can just end the conversation.

The older version doesn’t really work well on android either.
The older version is less performant, doesn’t have support or api compatibility that would make this useful.

Would you be willing to work with me a bit? I keep describing a solution that might work for you but you keep wanting to just find a quick workaround.

I don’t blame you for just wanting the quick around but we’re using this internally for quite a while and it’s working well and would net you good performance on top of that.

That quick workaround will be unsupported, less performant and not easy to update or maintain.

Well,in that case,I’ll work on both methods then.
I’ve uploaded my pre-trained raw moudle here:

As you can see,it consists of nothing but dense layers.
What should I do and what environment do I need to install?

(C++ is beyond me)

@huzpsb great appreciated. Dense layers shoudn’t be hard. Let me do a quick POC for you.

Looking at this first and just running it. I first did this:

import org.deeplearning4j.nn.multilayer.MultiLayerNetwork;
import org.nd4j.linalg.api.ndarray.INDArray;
import org.nd4j.linalg.factory.Nd4j;


public class MLPOpExecution {

    public static void main(String...args) throws Exception {
        MultiLayerNetwork multiLayerNetwork = MultiLayerNetwork.load(new File(""),true);
        INDArray rand = Nd4j.rand(1,22);


This got me with verbose/debug mode enabled:

Executing op: [matmul]
About to get variable in  execute output
node_1:0 result shape: [1, 30]; dtype: FLOAT; first values [0.301271, 0.0868343, -0.256554, 7.70856e-39, 0.438268, 0.0257094, -0.084022, 0.444107, -0.379782, 1.11918, -1.41936e-38, 0.617232, 0.200801, 0.589107, 0.287842, 0.503616, 8.49135e-39, 0.311303, 0.866285, -0.190678, -0.125231, -0.385527, -0.679998, -0.228666, 0.481709, 2.11147e-38, 0.764672, -0.839745, 0.0416826, -0.221541]
Executing op: [add]
About to get variable in  execute output
node_1:0 result shape: [1, 30]; dtype: FLOAT; first values [0.38315, 0.0807286, -0.296927, -0.00299783, 0.454721, 0.0242642, -0.0802173, 0.503197, -0.407838, 1.1786, -1.60034e-18, 0.669206, 0.229167, 0.623709, 0.316148, 0.511799, -0.0030775, 0.434601, 0.880709, -0.227417, -0.115579, -0.390552, -0.6831, -0.172079, 0.545913, -0.00552176, 0.859666, -0.883162, 0.0380682, -0.167185]
Executing op: [matmul]
About to get variable in  execute output
node_1:0 result shape: [1, 30]; dtype: FLOAT; first values [0.265153, 0.244696, 0.161446, 0.157242, -0.187973, -0.0313901, -0.388538, -0.040024, -0.226831, -2.85167e-38, -0.0510315, 0.35548, 0.372838, -0.158745, 0.314486, 0.138599, 0.455396, 0.269985, 0.477674, 0.407454, -0.257615, 0.574969, 0.456761, 0.208015, -0.108402, 0.655384, 0.406791, 0.118828, -0.386967, 0.436312]
Executing op: [add]
About to get variable in  execute output
node_1:0 result shape: [1, 30]; dtype: FLOAT; first values [0.225367, 0.312929, 0.167147, 0.117386, -0.147852, -0.0765696, -0.438198, -0.0206834, -0.162802, -0.00561494, -0.113384, 0.470834, 0.449319, -0.118911, 0.31141, 0.155436, 0.452145, 0.362151, 0.610688, 0.501175, -0.223319, 0.569607, 0.412892, 0.189691, -0.154787, 0.599811, 0.43007, 0.109047, -0.293645, 0.51485]
Executing op: [matmul]
About to get variable in  execute output
node_1:0 result shape: [1, 30]; dtype: FLOAT; first values [-0.0996243, 0.208537, 0.261221, 0.0982679, 0.0328799, 0.298155, -7.56417e-40, 0.0997858, 0.0696631, 0.101161, 0.230413, 0.341019, 0.0622007, 0.103668, 0.375, 0.0267858, 5.19955e-39, 0.125258, 1.09221e-38, 0.0244133, 0.103144, 0.267562, -0.0267185, 0.0209421, 0.0765266, 0.500674, 0.250974, 0.15208, 0.211295, 1.58391e-38]
Executing op: [add]
About to get variable in  execute output
node_1:0 result shape: [1, 30]; dtype: FLOAT; first values [0.0261627, 0.160388, 0.266522, 0.113639, 0.0757348, 0.290564, -0.000198537, 0.161714, 0.153787, 0.0441523, 0.189115, 0.47859, 0.19817, 0.144805, 0.350771, 0.000570966, -0.0016779, 0.168824, -0.114423, 0.00561272, 0.218659, 0.290733, -0.0846675, 0.131817, -0.0589332, 0.531538, 0.243041, 0.246128, 0.250609, -3.67472e-20]
Executing op: [matmul]
About to get variable in  execute output
node_1:0 result shape: [1, 3]; dtype: FLOAT; first values [-0.102108, -0.53706, 0.433646]
Executing op: [add]
About to get variable in  execute output
node_1:0 result shape: [1, 3]; dtype: FLOAT; first values [-0.158008, -0.450346, 0.386856]
Executing op: [softmax]
About to get variable in  execute output
node_1:0 result shape: [1, 3]; dtype: FLOAT; first values [0.28811, 0.215079, 0.49681]

So the main ops are matmul, softmax, and add.

I’ll update this post with some flags for you in a bit.

Thank you for all what you’ve done for me.But honestly speaking,I can’t really understand this.
Just,waiting for your update about what I can do.

@huzpsb I would recommend trying to. Asking on stackoverflow isn’t going to help you. You’ll get the same person helping you :slight_smile:

There’s not really much to understand. Neural networks have math operations they execute. Add, subtract, multiply etc.

In order to reduce the size of the binary the “compilation” would involve only including what you need and nothing more. The “inclusion” is basically code that executes only the math code you want and not other stuff. Excluding code you don’t need in your app reduces its size.

We have that feature built in to the build. What I posted was how to obtain what ops are being executed in your neural net. The debug and verbose mode just force dl4j to print what it’s doing.

All you have to do is look at that output to see the ops. I did that for you already.

The proof concept I’m throwing together for you will demonstrate how to do that. As long as you’re willing to include the result of that in your project you’ll have an apk with reduced size.

Please try to be patient when learning new things. You’re already more or less getting the work done for you.

When you want to do things like optimize a library’s size you can’t just ignore the details and expect magic results. You should try to spend some time understanding some of the fundamentals. Especially if they’re being explained to you. It also allows you to follow a formula (that you yourself understand) if you need to modify your neural network as well (say add a conv layer or something)

Well,I’ll try.
Just let me confirm,what i need to do is to compile the DL4J with

mvn -B -V -U clean install -pl  -Dlibnd4j.platform=linux-x86_64 -Djavacpp.platform=linux-x86_64 -Dmaven.test.skip=true -Dlibnd4j.compute=matmul, softmax, add

and use the newly compiled one,right?

@huzpsb So running through the POC I got it down to around 38MB. Not quite what you were looking for. If you’r efine with that I can post how I did it. Otherwise I won’t have you go through the hassle.

In short it involves a custom compilation this way:

 mvn -Dlibnd4j.operations="add;matmul;softmax" clean  package -DskipTests

This does what I mentioned and only includes code relevant to those ops which were determined by the debug there. I ran your code with a simple load and called .output(…) on the network.
Do you need to do training or just inference?

Next I can try to also think down data types and see what that gets me but that ran on my laptop and stripped it down from what’s normally around 125 MB.


Narrowed it down to 16MB with:

mvn -Dlibnd4j.operations="add;matmul;softmax"  -Dlibnd4j.datatypes="float;int" clean package -DskipTests

This restricts the used data types.

@huzpsb following your previous post after you do what I do with the c++ code base you just need to compile nd4j-native against that.

mvn -Pcpu  -rf :nd4j clean install -DskipTests

Note that if you compile for android though you’ll either need to compile on an ARM device or cross compile. If you decide to try it I can help you there as well.

Thank you.But 16MB isn’t fine for me.
Well,if there is nothing more I can do,maybe I’ll have to work on the other way and end this trail.
But I am still grateful for all this.

Though I think maybe you may still post the method here.I believe that it would be really helpful for anybody who read this someday later.

@huzpsb anything in particular driving the 10MB hard requirement?
Also, is the 10MB compressed or uncompressed? Either way your use case is still a good data point for me to start from.

You can at least export the weights from your network in the mean time. I can try to help with something in the mean time. I can also see what else I can do to get it down.

Almost all native libs take up some space.

Only 10MB left for me to upload it via coolDroid(an android app sharing platform in China,they limit the size of published apps until they gain certain hits.To save their bandwith,I guess)
I don’t want to cut out the features I already have.

And,if 16MB is uncompressed(by jar or even by proguard),it can’t be more helpful!

@huzpsb thanks a lot! Totally fair! Low expectations but let me see what else I can do for you in the mean time. We have a few more compilation flags off by default that I’m seeing if I can reduce the size a bit. Either way this resulted in some good changes for end users on mobile so I appreciate you at least hearing me out!

@huzpsb with LTO turned on got it down to 3.7MB. This also added a new flag. Would you still be up for trying it now?

Yes of course!!! <33333333333