I am using dl4j-cuda under the version M1.1 and I am running my application on an HPC with cuda version 11.1. I have got an error: ```
Caused by: java.lang.UnsatisfiedLinkError: /home/h4/nore667e/.javacpp/cache/deepLearningSimpleOne-1.0-SNAPSHOT-jar-with-dependencies.jar/org/nd4j/nativeblas/linux-x86_64/libjnind4jcuda.so: /lib64/libm.so.6: version `GLIBC_2.23’ not found (required by /home/h4/nore667e/.javacpp/cache/deepLearningSimpleOne-1.0-SNAPSHOT-jar-with-dependencies.jar/org/nd4j/nativeblas/linux-x86_64/libnd4jcuda.so)
I tried to install GLIBC 2.23 but it didn't work. My dependencies are:
<dependency>
<groupId>org.bytedeco</groupId>
<artifactId>cuda-platform-redist</artifactId>
<version>11.2-8.1-1.5.5</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.nd4j</groupId>
<artifactId>nd4j-cuda-11.2</artifactId>
<version>1.0.0-M1.1</version>
<classifier>linux-x86_64-cudnn</classifier>
</dependency>
<dependency>
<groupId>org.nd4j</groupId>
<artifactId>${nd4j.backend}</artifactId>
<version>${dl4j-master.version}</version>
</dependency>
<dependency>
<groupId>org.deeplearning4j</groupId>
<artifactId>deeplearning4j-cuda-11.2</artifactId>
<version>1.0.0-M1.1</version>
</dependency>
I tried to fix it it has been a while but I didn't succeed. Does anyone knows how to fix it ?
Thank you!
What OS are you running on for the workers? You need a newer glibc in order to run this. If you are running on a very old system you will not be able to run your workload. I can try to help you figure out a better way to do this but will need more to work with.
These are the features of my OS:
NAME=“CentOS Linux”
VERSION=“7 (Core)”
ID=“centos”
ID_LIKE=“rhel fedora”
VERSION_ID=“7”
PRETTY_NAME=“CentOS Linux 7 (Core)”
ANSI_COLOR=“0;31”
CPE_NAME=“cpe:/o:centos:centos:7”
HOME_URL=“https://www.centos.org/”
BUG_REPORT_URL=“https://bugs.centos.org/”
Thank you so much for your reply. That’s unfortunate cause I already installed GLIBC_2.23 but I still have the same error.
I wanted to know as well if gpu/Cuda is supported by the version beta7? So that I can switch to beta7 since I didn’t have such a problem with beta7.
@Nour-Rekik it is but not that cuda version. That version is also very old and won’t be as performant. Could you give me a full stack trace? If you have the same error it means glibc isn’t being used. You may not have installed it on the workers where this is running. Please double check how you did that.
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.nd4j.jita.concurrency.CudaAffinityManager.getNumberOfDevices(CudaAffinityManager.java:136)
at org.nd4j.jita.constant.ConstantProtector.purgeProtector(ConstantProtector.java:60)
at org.nd4j.jita.constant.ConstantProtector.<init>(ConstantProtector.java:53)
at org.nd4j.jita.constant.ConstantProtector.<clinit>(ConstantProtector.java:41)
at org.nd4j.jita.constant.ProtectedCudaConstantHandler.<clinit>(ProtectedCudaConstantHandler.java:69)
at org.nd4j.jita.constant.CudaConstantHandler.<clinit>(CudaConstantHandler.java:38)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.nd4j.common.config.ND4JClassLoading.loadClassByName(ND4JClassLoading.java:62)
at org.nd4j.common.config.ND4JClassLoading.loadClassByName(ND4JClassLoading.java:56)
at org.nd4j.linalg.factory.Nd4j.initWithBackend(Nd4j.java:5152)
at org.nd4j.linalg.factory.Nd4j.initContext(Nd4j.java:5093)
at org.nd4j.linalg.factory.Nd4j.<clinit>(Nd4j.java:270)
at org.datavec.image.loader.NativeImageLoader.transformImage(NativeImageLoader.java:670)
at org.datavec.image.loader.NativeImageLoader.asMatrix(NativeImageLoader.java:593)
at org.datavec.image.loader.NativeImageLoader.asMatrix(NativeImageLoader.java:281)
at org.datavec.image.loader.NativeImageLoader.asMatrix(NativeImageLoader.java:256)
at org.datavec.image.loader.NativeImageLoader.asMatrix(NativeImageLoader.java:250)
at org.datavec.image.recordreader.BaseImageRecordReader.next(BaseImageRecordReader.java:247)
at org.datavec.image.recordreader.BaseImageRecordReader.nextRecord(BaseImageRecordReader.java:511)
at org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.initializeUnderlying(RecordReaderDataSetIterator.java:194)
at org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.next(RecordReaderDataSetIterator.java:341)
at org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.next(RecordReaderDataSetIterator.java:421)
at org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.next(RecordReaderDataSetIterator.java:53)
at com.examples.DeepLearningOnSpark.imageNet_image.streaming.NetworkRetrainingMain.entryPoint(NetworkRetrainingMain.java:55)
at com.examples.DeepLearningOnSpark.imageNet_image.streaming.NetworkRetrainingMain.main(NetworkRetrainingMain.java:31)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.RuntimeException: ND4J is probably missing dependencies. For more information, please refer to: https://deeplearning4j.konduit.ai/nd4j/backend
at org.nd4j.nativeblas.NativeOpsHolder.<init>(NativeOpsHolder.java:116)
at org.nd4j.nativeblas.NativeOpsHolder.<clinit>(NativeOpsHolder.java:37)
... 38 more
Caused by: java.lang.UnsatisfiedLinkError: no jnind4jcuda in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867)
at java.lang.Runtime.loadLibrary0(Runtime.java:870)
at java.lang.System.loadLibrary(System.java:1122)
at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:1718)
at org.bytedeco.javacpp.Loader.load(Loader.java:1328)
at org.bytedeco.javacpp.Loader.load(Loader.java:1132)
at org.nd4j.nativeblas.Nd4jCuda.<clinit>(Nd4jCuda.java:10)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.nd4j.common.config.ND4JClassLoading.loadClassByName(ND4JClassLoading.java:62)
at org.nd4j.common.config.ND4JClassLoading.loadClassByName(ND4JClassLoading.java:56)
at org.nd4j.nativeblas.NativeOpsHolder.<init>(NativeOpsHolder.java:88)
... 39 more
Caused by: java.lang.UnsatisfiedLinkError: /home/h4/nore667e/.javacpp/cache/deepLearningSimpleOne-1.0-SNAPSHOT-jar-with-dependencies.jar/org/nd4j/nativeblas/linux-x86_64/libjnind4jcuda.so: /lib64/libm.so.6: version `GLIBC_2.23' not found (required by /home/h4/nore667e/.javacpp/cache/deepLearningSimpleOne-1.0-SNAPSHOT-jar-with-dependencies.jar/org/nd4j/nativeblas/linux-x86_64/libnd4jcuda.so)
at java.lang.ClassLoader$NativeLibrary.load(Native Method)
at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1941)
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1824)
at java.lang.Runtime.load0(Runtime.java:809)
at java.lang.System.load(System.java:1086)
at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:1668)
Since I am using spark I did the configuration source framework-configure.sh spark $SPARK_HOME/conf and in spark-env.sh I added export LD_LIBRARY_PATH=/scratch/ws/1/s4122485-glibc/lib_new/:$LD_LIBRARY_PATH
then continue with the start-all.sh
Hello, I’m not an official DL4J person, but I too had a lot of trouble running on Centos7 because of library versions. I couldn’t upgrade some things to the minimum DL4J required versions because it broke other things.
My ultimate solution was to recompile DL4J on my target architecture. It was a lot of work to figure out how to do it, and it takes a long time, but it worked. I had to make changes to DL4J build scripts to suit CENTOS7 as well.
As a rough outline, to compile DL4J I:
Installed the devtoolset kit. I think I went with version 10. ‘yum install devtoolset-10*’
Download and build and install the source code for a new version of cmake. I got it here: Github Kitware/cmake Source
Run the maven build… I don’t have the commands handy as I turned this all into an ansible script. But I based it off of the directions here and it was something like: mvn clean package -DskipTests -Pcpu -Pcuda -Djavapp.cpp=x86_64… If you want, I can find the exact commands
@chris2 of note you don’t actually need openblas when building for cuda.
Regarding your use case here…at least for cpu note that we do have the -compat classifier which builds for very old glibcs. I would suggest you try that at least for cpu.
Do you have a use case where you were running something like an older glibc but using a newer cuda? Do people actually do that? Usually nvcc (the cuda compiler) is hard coded to specific cuda versions. I’m not sure how that lines up with older glibcs.
It’s working on making the modular c++ code base + backend architecture a bit easier for people to deal with while hopefully giving people the knobs to control things they want like binary size, glibc version, cuda version, platform, etc.
If you could elaborate a bit it would help with potentially automating some of the pain points here. We already publish a ton of different classifiers for about every conceivable combination people usually want but I guess that itself still has limitations.
I am understanding from the word “building from source” that I should run maven command: mvn clean package which will create jar-with-dependencies file then submit this jar-with-dependency file using spark right ?
@Nour-Rekik no building from source means manually downloading the code from the deeplearning4j repo and building it yourself.
This does not mean building your code with maven.
This means using c++ compilers, java build tools and using 1.0.0-SNAPSHOT in your version.
In order to do this you will need to figure out what software is already installed on your cluster like the nvcc version, gcc version and the like.
Please do let me know that and follow this guide:
The reason you need to do that is because the source code does support older versions but you have to build it manually. It’s hard to support more than 2 cuda versions for each release. In our open source code and binaries we only support the 2 most recent versions, but you can build the code to match the version you want.
Beyond that feel free to ask for help with that here.
Thank you for the explanation. But I didn’t find clear steps to follow in the guide to build from source. Also, I don’t know if it’s feasible on an HPC cluster. I am sorry but I still don’t get it how to do it.
I have on the HPC cluster:
gcc --version → 8.3.0
and these are CUDA versions available on the cluster:
Establish what cuda you have installed. From here I can see cuda 11.4. That’s a good match for something we support. In your case you’ll want to follow the building for cuda section and change it to cuda 11.4.
For the command you’ll be using maven to install the build. Maven also controls the build process for the c++ library that uses cuda underneath.
When you initiate the build process you will want to ensure NVCC is on your path.
Once you ensure that the right cuda folder is first on your path and it finds the right nvcc you can start to build your project.
You’ll need a relevant java version as well. In your case you’ll need a JDK. I would suggest azul: Java Download | Java 8, Java 11, Java 13 - Linux, Windows & macOS You’ll want to download one for centos 7.
Extract the binary from there and set your JAVA_HOME.
Maven needs that in order to use the relevant java compiler.
So in summary:
Clone dl4j
Download maven, jdk
Setup nvcc, mvn executables on your path.
Run ./change-cuda-versions.sh in the root dl4j directory
I cloned dl4j and I changed the cuda version to 11.0 using ./change-cuda-versions.sh (I didn’t put 11.4 because I don’t know its convenient gcc version)
Then I used this command in the root dl4j directory:
but it does contain some other comments around this topic in case you’re wondering. I would recommend following the general spirit of this guide but with the most recent cmake 3.24. You may need other things in order for this to work.
Note you only need to do this once.
Ensure cmake is in your path similar to the other steps above. This also shouldn’t need admin privileges.
Try to read the errors. If you look right above the error there you’ll see cmake not found which is how I was able to tell you what to do next.