Keras Functional Model on Spark using DL4J

Hi there, I am hitting a road block while successfully using DL4J + ND4J for deploying a Keras Functional Model over Spark at Runtime. Let me walk you through couple things I have tried and explain problem with each,

NOTE: Model is built using Python + Keras

python version: 3.8.0
keras version: 2.4.0
tensorflow version: 2.4.0

NOTE: Model is saved as a whole w H5 format

model.save(‘model.h5’)

  1. Using latest available release artifacts: 1.0.0-beta7

     <dependency>
         <groupId>org.deeplearning4j</groupId>
         <artifactId>deeplearning4j-core</artifactId>
         <version>1.0.0-beta7</version>
     </dependency>
     <dependency>
         <groupId>org.deeplearning4j</groupId>
         <artifactId>deeplearning4j-modelimport</artifactId>
         <version>1.0.0-beta7</version>
     </dependency>
     <dependency>
         <groupId>org.nd4j</groupId>
         <artifactId>nd4j-native</artifactId>
         <version>1.0.0-beta7</version>
     </dependency>
    

Model Import using,

val model = KerasModelImport.importKerasModelAndWeights(‘model.h5’, false)

But 1.0.0-beta7 fails to import Functional model successfully with following error.

Expected model class name Model (found Functional). For more information, see Overview - Deeplearning4j
org.deeplearning4j.nn.modelimport.keras.exceptions.InvalidKerasConfigurationException: Expected model class name Model (found Functional). For more information, see Overview - Deeplearning4j

At this point I shift over to latest available over SNAPSHOT repos. i.e. https://oss.sonatype.org/content/repositories/snapshots

  1. Using latest available release artifacts: 1.0.0-beta7

     <dependency>
         <groupId>org.deeplearning4j</groupId>
         <artifactId>deeplearning4j-core</artifactId>
         <version>1.0.0-SNAPSHOT</version>
     </dependency>
     <dependency>
         <groupId>org.deeplearning4j</groupId>
         <artifactId>deeplearning4j-modelimport</artifactId>
         <version>1.0.0-SNAPSHOT</version>
     </dependency>
     <dependency>
         <groupId>org.nd4j</groupId>
         <artifactId>nd4j-native</artifactId>
         <version>1.0.0-SNAPSHOT</version>
     </dependency>
    

This resolved Keras Functional model import problem and allowed me to successfully establish a fully functional working model import and model predict on the local environment. Great feeling! But another problem was encountered while deploying this model JAR over the runtime Spark environment. i.e.

Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.nd4j.linalg.factory.Nd4j

Above is encountered at model import. i.e. very first point of contact.

java.lang.NoClassDefFoundError: Could not initialize class org.nd4j.linalg.factory.Nd4j
at org.deeplearning4j.nn.modelimport.keras.Hdf5Archive.readDataSet(Hdf5Archive.java:295)
at org.deeplearning4j.nn.modelimport.keras.Hdf5Archive.readDataSet(Hdf5Archive.java:109)
at org.deeplearning4j.nn.modelimport.keras.utils.KerasModelUtils.importWeights(KerasModelUtils.java:284)
at org.deeplearning4j.nn.modelimport.keras.KerasModel.(KerasModel.java:190)
at org.deeplearning4j.nn.modelimport.keras.KerasModel.(KerasModel.java:99)
at org.deeplearning4j.nn.modelimport.keras.utils.KerasModelBuilder.buildModel(KerasModelBuilder.java:311)
at org.deeplearning4j.nn.modelimport.keras.KerasModelImport.importKerasModelAndWeights(KerasModelImport.java:150)
at com.example.Model$.apply(Model.scala:29)

Note, it was verified that both JAR and classpath contains the class under question. At this point I suspected if Spark seeks platform specific Nd4j implementation under classpath and hence attempted to include nd4j-native-platform artifact under the JAR instead of basic nd4j-native. Correct me if step 3 was an incorrect move.

  1. Using nd4j-native-platform artifact instead of basic nd4j-native

     <dependency>
         <groupId>org.deeplearning4j</groupId>
         <artifactId>deeplearning4j-core</artifactId>
         <version>1.0.0-SNAPSHOT</version>
     </dependency>
     <dependency>
         <groupId>org.deeplearning4j</groupId>
         <artifactId>deeplearning4j-modelimport</artifactId>
         <version>1.0.0-SNAPSHOT</version>
     </dependency>
     <dependency>
         <groupId>org.nd4j</groupId>
         <artifactId>nd4j-native-platform</artifactId>
         <version>1.0.0-SNAPSHOT</version>
     </dependency>
    

This fails to compile on following,

[ERROR] Failed to execute goal on project apple: Could not resolve dependencies for project com.example.ls:apple:jar:2.3-SNAPSHOT: The following artifacts could not be resolved: org.nd4j:nd4j-native:jar:android-arm:1.0.0-SNAPSHOT, org.nd4j:nd4j-native:jar:android-arm64:1.0.0-SNAPSHOT, org.nd4j:nd4j-native:jar:android-x86:1.0.0-SNAPSHOT, org.nd4j:nd4j-native:jar:android-x86_64:1.0.0-SNAPSHOT, org.nd4j:nd4j-native:jar:linux-ppc64le:1.0.0-SNAPSHOT: Could not transfer artifact org.nd4j:nd4j-native:jar:android-arm:1.0.0-SNAPSHOT from/to maven-local-release : Failed to transfer file: http://artifactory.example.com:8000/artifactory/maven-local-release/org/nd4j/nd4j-native/1.0.0-SNAPSHOT/nd4j-native-1.0.0-SNAPSHOT-android-arm.jar. Return code is: 409 , ReasonPhrase:Conflict. → [Help 1]

Sorry for this long thread, but if I can obtain help with any of these outstanding roadblocks, it would be much appreciated.

PS: While I do read under other post that new beta OR RC release is expected within few weeks, it would be nice if team can push out latest SNAPSHOT to release repo with say beta OR alpha tags to help address #1 and potentially #3. Working with SNAPSHOT’s is more like living on the edge as you can break tomorrow morning. :smile: Regardless, I do seek your guidance here in the given scenario.

Well as of this morning it seems like SNAPSHOT is broken upon clean fetch. i.e. with mvn -U flag.

java: error reading /Users/panchal/.m2/repository/org/nd4j/nd4j-native/1.0.0-SNAPSHOT/nd4j-native-1.0.0-20210327.020247-21287-macosx-x86_64.jar; zip file is empty

@panchal Could you clarify a bit? Index of /repositories/snapshots/org/nd4j/nd4j-native/1.0.0-SNAPSHOT
I see macosx jars in the snapshots right there.

Regarding your other point, yes platform pulls in every platform. We need to finish the ppc builds yet. For now, please just use nd4j-native rather than platform for snapshots and it will pull in only your specific platform.

Hello @agibsonccc , Yes JAR version nd4j-native-1.0.0-20210327.020247-21287-macosx-x86_64.jar was being pulled fine but was of size 0 bytes. And hence above error. I haven’t got chance to update my local SNAPSHOT build after this version fetch.

Fortunately, I had previous SNAPSHOT backed up that helped me replace this empty file.

Follow up to my original queries and the set issues I reported. Here’s how I worked around it.

Step 1: Import Functional Keras model using 1.0.0-SNAPSHOT build

KerasModelImport.importKerasModelAndWeights(‘model.h5’, false),

Step 2: Save imported model as ComputationGraph using 1.0.0-SNAPSHOT build

model.save(new File(modelPath), false)

Step 3: Re-import saved ComputationGraph model instance using 1.0.0-beta7 build

ComputationGraph.load(new File(modelPath), false)

Switching to 1.0.0-beta7 also allowed me to include nd4j-native-platform for the runtime Spark environment unblocking runtime issues. Hope someone with similar conditions finds this helpful.

ps: I eagerly wait for the next RC. :slight_smile:

@panchal Hm, that didn’t really answer my question. Sorry to be insistent here, but could you try forcing a refresh or deleting? Yes, it will be coming soon. Still doing the final QA work and we’ll have the release out soon.

That may happen when you have a connection issue. Things like that happen from time to time.

Just to give you an idea how to do that: The -U maven flag forces a snapshot refresh, so you can run something like mvn -U dependency:go-offline and it should redownload all of your snapshot dependencies.

You can also run mvn dependency:purge-local-repository and it will remove projects dependencies from your local repository and then it will redownload them on the next build.

@agibsonccc I confirm, the brand new attempt with -U flag was successful and 0 bytes JAR issue is no longer reproducible. @treo I am bit skeptical if it had to with connection problems given it lasted for several hour before I went ahead and reported it. But at this point I wouldn’t doubt it any further if it were to do with connection issues. Thanks guys for looking into this.

My ISP had issues with the backward path connection from AWS for several weeks. So the upload worked fine, but the download was extremely slow.

So connection issue towards one specific endpoint in the internet can actually happen.

But, I’m happy that just refreshing snapshots worked for you :slight_smile: