Issues about modifying the source code

@fubuki thanks for understanding and please do check in! Sometimes issues are easier than I’m making them out to be but when I sit down to look at issues like this where I have to deep dive I try to allocate at least an hour.

@agibsonccc I would also try to fix this issue by myself. Thanks!

@fubuki Sorry for taking so long.
Are you trying to emulate a module like structure?
I also looked through your code.

Looking at this a bit Samediff allows you to define subfunctions if you just trying to modularize. From what I’m seeing this should just be a straightforward chain. There shouldn’t be any need to declare them separately. Could you describe a bit about how you want to use this? Psuedo code is fine. What I’m imagining is:

INDArray baseInput = ...;

SameDiff one = SameDiff.create();
//declare base input as variable
one.placeholder("input",...);
one.placeholder("label",...);
//declare functions ...
one.setTrainingConfig(..);
one.fit(..);
Map<String,INDArray> inputs = ...;
Map<String,INDArray> outputsOfOne = one.output(inputs ,"output");
SameDiff two = SameDiff.create();
//declare functions...
two.setTrainingConfig(..);
two.fit(outputsOfOne,someLabelsForTwo);



You could chain something like this as well if you want if you’re trying to achieve some sort of a teacher/student model. It takes a bit of setting up but I think you get the idea. Basically create each instance, set the training config, call fit on one, get the output of one, pass it to two,…

That should prevent the need even to use externalGradients. If you do need to use those then you can still do what we used with ExternalGradients.

Thank you for your review of my code! What I eventually want to do is to split a model into several sub-models, put these sub-models to different devices and train the model in a distributed manner, as the following figure shows.


So the gradient of the input in each sub-models should be passed to its previous sub-model when executing backward.

Here is some pseudo code:

// one, two and three are the sub-models
// forward
output1 = one.forward(inputs)
output2 = two.forward(output1)
loss = three.forward(output2)

// backward and update the weight on each sub-model
three.backward()
two.backward(output2.grad)
one.backward(output1.grad)

But in your code, it seems that each sub-model are trained independently, that is, the gradient of each sub-model can not be passed to its previous sub-model.
Thank you!

@fubuki in that case some form of ExternalErrors with:
samediff.externalErrors is what you’d want in order to make that work alongside what I showed you here.

Originally what I was thinking was manually passing in the outputs of each model as a pipeline.

Let me think on your psudo code a bit and try to come up with an example. I think what you’re asking for here seems doable just needs a bit of wiring along the lines of this:
https://github.com/eclipse/deeplearning4j/blob/12b8ff3514f3c996b7b7aa8cd7af3b1905da4102/nd4j/nd4j-backends/nd4j-tests/src/test/java/org/nd4j/autodiff/samediff/SameDiffTests.java#L2349-L2348

I would see the GitHub link you provided.
As I mentioned last week, when I tried to use the externalError to pass the gradients, error occurs. If you want to reproduce the error, you may run the subModelTrainTest() in the MNISTCNNTest.java. The error occurs in line 118. I have no idea of how to fix that error. Thanks!

@agibsonccc Hi, I tried to run the android example on a mobile phone, but when I run it on a physical device, it reports an error which says “More than one file was found with OS independent path ‘META-INF/native-image/linux-ppc64le/jnijavacpp/reflect-config.json’”
The android example is the project found here deeplearning4j-examples/android-examples at master · eclipse/deeplearning4j-examples · GitHub

I also found the one who had the same error as me in the forum (link: Troubles with launching DL4J on Android - #6 by UnDan)
But as I add " exclude ‘META-INF/native-image/**’", a new error reports “2 files found with path ‘nd4j-native.properties’.” reports.
I tested on Windows 11 and macOS, all of them had this error.

Thank you!

@fubuki ensure you only include the classifiers you need. The -platform classifiers include all dependencies and are a bit of a trouble with gradle.

As for nd4j-native.properties that isn’t under META-INF-native-image/** so I have a feeling something else is going on. Could I look at your build.gradle?

@agibsonccc Here is my build.gradle file. FTPipeHD_Android/build.gradle at main · fubukishiro/FTPipeHD_Android · GitHub

By adding the code ‘pickFirst ‘nd4j-native.properties’’, the problem seems solved. But I am not sure whether this is a correct way to fix this.

Another issue is that when I ran my app on physical device, I found that the MnistDataSetIterator in DL4j failed to ‘mkdir .deeplearning/data/MNIST’. I have already given the write permission to the app. Does that mean the MnistDataSetIterator is not available on android?

Thank you!

@fubuki I’ve found this to work in the emulator:

  packagingOptions {
        exclude 'META-INF/native-image/**/**.json'
        pickFirst 'nd4j-native.properties'
    }

These are just related to graalvm and don’t really need to be included in the binary anyways. This is a fairly common problem you’ll bump in to using any library outside of dl4j so it’s more just knowing the basics of android development there. I’ll update the example on the next release. Thanks for flagging.

@agibsonccc Your solution works! Thank you very much!

But there is another issue I met. When I ran my app on physical device, I found that the MnistDataSetIterator in DL4j ‘Could not mkdir .deeplearning4j/data/MNIST’. I have already given the write permission to the app. Does that mean the MnistDataSetIterator is not available on android? Or is this the problem of the physical device?

@fubuki you likely need to set the app permissions in the emulator. Almost all apps require you to grant permissions usually prompted by the user. It’s not as sweeping as “it’s not supported” it’s more “the mnist dataset iterator writes some data to disk and apps require permission to do that”

I will try it again! Thank you for your timely help!

@agibsonccc Hi, when I call the sd.calculateGradients function in Android, the program got stuck without any error.

I tried to use debug mode to solve this problem, but the debugger got stuck in the following code in SameDiff.java

OutAndGrad oag = calculateGradientsAndOutputs(placeholderVals, null, variables);

Have you tried to use this function in Android app?

When setting up the model, it has many error logs like the following

2021-12-20 18:30:58.137 7638-7961/com.example.androidDl4jClassifier E/linker: library "/vendor/lib64/hw/gralloc.sdm660.so" ("/vendor/lib64/hw/gralloc.sdm660.so") needed or dlopened by "/data/app/com.example.androidDl4jClassifier-g--tRchmKEZKtMz3VnDa_Q==/lib/arm64/libjnijavacpp.so" is not accessible for the namespace: [name="classloader-namespace", ld_library_paths="", default_library_paths="/data/app/com.example.androidDl4jClassifier-g--tRchmKEZKtMz3VnDa_Q==/lib/arm64:/data/app/com.example.androidDl4jClassifier-g--tRchmKEZKtMz3VnDa_Q==/base.apk!/lib/arm64-v8a", permitted_paths="/data:/mnt/expand:/data/data/com.example.androidDl4jClassifier"]
2021-12-20 18:30:58.137 7638-7961/com.example.androidDl4jClassifier E/linker: library "/vendor/lib64/libqdMetaData.so" ("/vendor/lib64/libqdMetaData.so") needed or dlopened by "/data/app/com.example.androidDl4jClassifier-g--tRchmKEZKtMz3VnDa_Q==/lib/arm64/libjnijavacpp.so" is not accessible for the namespace: [name="classloader-namespace", ld_library_paths="", default_library_paths="/data/app/com.example.androidDl4jClassifier-g--tRchmKEZKtMz3VnDa_Q==/lib/arm64:/data/app/com.example.androidDl4jClassifier-g--tRchmKEZKtMz3VnDa_Q==/base.apk!/lib/arm64-v8a", permitted_paths="/data:/mnt/expand:/data/data/com.example.androidDl4jClassifier"]
2021-12-20 18:30:58.138 7638-7961/com.example.androidDl4jClassifier E/linker: library "/vendor/lib64/libgrallocutils.so" ("/vendor/lib64/libgrallocutils.so") needed or dlopened by "/data/app/com.example.androidDl4jClassifier-g--tRchmKEZKtMz3VnDa_Q==/lib/arm64/libjnijavacpp.so" is not accessible for the namespace: [name="classloader-namespace", ld_library_paths="", default_library_paths="/data/app/com.example.androidDl4jClassifier-g--tRchmKEZKtMz3VnDa_Q==/lib/arm64:/data/app/com.example.androidDl4jClassifier-g--tRchmKEZKtMz3VnDa_Q==/base.apk!/lib/arm64-v8a", permitted_paths="/data:/mnt/expand:/data/data/com.example.androidDl4jClassifier"]
2021-12-20 18:30:58.138 7638-7961/com.example.androidDl4jClassifier E/linker: library "/system/lib64/vndk-sp-29/android.hardware.graphics.common@1.1.so" ("/system/lib64/vndk-sp-29/android.hardware.graphics.common@1.1.so") needed or dlopened by "/data/app/com.example.androidDl4jClassifier-g--tRchmKEZKtMz3VnDa_Q==/lib/arm64/libjnijavacpp.so" is not accessible for the namespace: [name="classloader-namespace", ld_library_paths="", default_library_paths="/data/app/com.example.androidDl4jClassifier-g--tRchmKEZKtMz3VnDa_Q==/lib/arm64:/data/app/com.example.androidDl4jClassifier-g--tRchmKEZKtMz3VnDa_Q==/base.apk!/lib/arm64-v8a", permitted_paths="/data:/mnt/expand:/data/data/com.example.androidDl4jClassifier"]

I am not sure whether this is the reason why the calculateGradients function stuck.

Thanks!

@fubuki that looks like more permissions issues. Could you ensure that we have permissions to load the necessary native libraries? You can see that it says things like:
/vendor/lib64/hw/gralloc.sdm660.so" needed by namespace dl4j etc…
I’m not sure what permission it is but it’s definitely another issue like that. - You can see a similar problem here:

Could you please add this configuration to see if this fixes your issue?

Thank you for you link, but the second link you provided is targeting for Android 11 while I am running my app on Android 10.

But there are too many *.so files which are not accessible. I have no idea of how to give permission to such a great amount of *.so files. Are you testing the DL4J API in root privilege?

@agibsonccc With the privilege issue unsolved, I found that the sd.calculateGradients function continues 20 minutes later after getting stuck. After that the program does not get stuck at this function anymore. It seems to take a long time when calling this function for the first time. Is this normal?

@fubuki what kind of phone/tablet are you running on? I have 3 different android devices I could test on to see if you have something for me.

Beyond that, I feel like this permissions issue should be a standard thing with app permissions. I looked at your android manifest and saw you have the standard permissions. Somehow I don’t think you have to provide a list of files manually. I feel like it would be a wild card or some sort of a sweeping permission you could add. I took a cursory look at it it’s hard to see what the issue could be though.

Do you know if I could test what you’re seeing in the android studio emulator so I could try something?

I don’t want to extrapolate too much on runtime performance till we solve this basic issue first.

@fubuki more broadly before I forget we do have folks running successfully on android (search android on the forums). Generally the problems I hear about are related to binary size or neural network speed. We recently integrated armcompute libraries to help with that.
I don’t think I’ve seen this permissions problem you’re having from any other user. That’s why I’d like to try to understand a bit about your environment.

@agibsonccc Hi, there. Sorry for replying so late. I am running on Samsung A9 Star. I also tried to run my app on the Android Studio Emulator. It also has the privilege issue.

Here is the config of my emulator copied from the AVD manager.

Name: Pixel_5_API_29
CPU/ABI: Google APIs Intel Atom (x86)
Target: google_apis [Google APIs] (API level 29)
Skin: pixel_4
SD Card: 512M
fastboot.chosenSnapshotFile: 
runtime.network.speed: full
hw.accelerometer: yes
hw.device.name: pixel_5
hw.initialOrientation: Portrait
image.androidVersion.api: 29
tag.id: google_apis
hw.mainKeys: no
hw.camera.front: emulated
avd.ini.displayname: Pixel 5 API 29
hw.gpu.mode: auto
hw.ramSize: 4096
PlayStore.enabled: false
fastboot.forceColdBoot: no
hw.cpu.ncore: 4
hw.keyboard: yes
hw.sensors.proximity: yes
hw.dPad: no
hw.lcd.height: 2340
vm.heapSize: 1024
skin.dynamic: yes
hw.device.manufacturer: Google
hw.gps: yes
hw.audioInput: yes
image.sysdir.1: system-images\android-29\google_apis\x86\
showDeviceFrame: yes
hw.camera.back: virtualscene
AvdId: Pixel_5_API_29
hw.lcd.density: 440
hw.arc: false
fastboot.forceChosenSnapshotBoot: no
fastboot.forceFastBoot: yes
hw.trackBall: no
hw.sdCard: yes
tag.display: Google APIs
runtime.network.latency: none
disk.dataPartition.size: 4G
hw.sensors.orientation: yes
hw.gpu.enabled: yes

And here is the Github link of my app. GitHub - fubukishiro/FTPipeHD_Android

You can directly use the emulator in Android Studio to reproduce the issue, whose error message is printed in the Logcat.

Sincerely thank you for your help!