RNN and FF getting mixed up in MergeVertex, new FileStatsStorage not work

MPdaedalus · March 7, 2021, 4:51pm

my mergevertex is complaining that it is getting layers with different types as input, it should all be rnn as I am using 1dcnn with an input vector of 483 values.

	ComputationGraphConfiguration.GraphBuilder graph = new NeuralNetConfiguration.Builder().seed(seed).activation(Activation.SWISH).optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT).updater(new Adam(0.0003,0.9,0.999,0.1)).weightInit(WeightInit.XAVIER).miniBatch(true)
			.cacheMode(CacheMode.NONE).trainingWorkspaceMode(WorkspaceMode.ENABLED).inferenceWorkspaceMode(WorkspaceMode.ENABLED).convolutionMode(ConvolutionMode.Causal).graphBuilder();
	graph.setInputTypes(InputType.recurrent(483));
	//stem
	graph
		.addLayer("stem-cnn1",new Convolution1DLayer.Builder(3,2).nIn(483).nOut(32).build(),"input")

		.addLayer("stem-batch1", new BatchNormalization.Builder(false).decay(0.995).eps(0.001).nIn(32).nOut(32).build(),"stem-cnn1")

		.addLayer("stem-cnn2",new Convolution1DLayer.Builder(3).nIn(32).nOut(32).build(),"stem-batch1")

		.addLayer("stem-batch2",new BatchNormalization.Builder(false).decay(0.995).eps(0.001).nIn(32).nOut(32).build(),"stem-cnn2")

		.addLayer("stem-cnn3",new Convolution1DLayer.Builder(3).nIn(32).nOut(64).build(),"stem-batch2")

        .addLayer("stem-batch3", new BatchNormalization.Builder(false).decay(0.995).eps(0.001).nIn(64).nOut(64).build(), "stem-cnn3")
        //left branch
        .addLayer("stem-pool1",new Subsampling1DLayer.Builder(Subsampling1DLayer.PoolingType.MAX, 3, 2).build(),"stem-batch3")
        //right branch
        .addLayer("stem-cnn4",new Convolution1DLayer.Builder(3,2).nIn(64).nOut(96).build(),"stem-batch3")

        .addLayer("stem-batch4", new BatchNormalization.Builder(false).decay(0.995).eps(0.001).nIn(96).nOut(96).build(), "stem-cnn4")
        //merge
        .addVertex("concat1", new MergeVertex(),"stem-pool1", "stem-batch4")

the error>

Invalid input: MergeVertex cannot merge activations of different types: first type = RNN, input type 2 = FF
at org.deeplearning4j.nn.conf.graph.MergeVertex.getOutputType(MergeVertex.java:139)
at org.deeplearning4j.nn.conf.ComputationGraphConfiguration.getLayerActivationTypes(ComputationGraphConfiguration.java:537)
at org.deeplearning4j.nn.conf.ComputationGraphConfiguration.addPreProcessors(ComputationGraphConfiguration.java:450)
at org.deeplearning4j.nn.conf.ComputationGraphConfiguration$GraphBuilder.build(ComputationGraphConfiguration.java:1202)

For some reason the BatchNormalization is using feed forward which is messing things up, i assume BatchNormalization is supported for 1dcnn? Is some preprocessor not getting triggered maybe?

on an unreleated issue

new FileStatsStorage(new File(“/home/workspace/netStats/test1.dat”)); to attch to UI server is not working, it complains there is no file exist, if i create empty text file with same name it complains not valid mapDB database, so how do it create the file in the first place for saving the stats to file? new InMemoryStatsStorage(); works fine

Thanks in advance

MPdaedalus · March 15, 2021, 4:12pm

I have found out that the merge vertex works fine if the inputs are Convolution1DLayers and pooling layers, I can thus only do BatchNormalization successfully AFTER the merge vertex like so>

        //left branch
        .addLayer("stem-pool1",new Subsampling1DLayer.Builder(Subsampling1DLayer.PoolingType.MAX, 3, 2).build(),"stem-batch3")
        //right branch
        .addLayer("stem-cnn4",new Convolution1DLayer.Builder(3,2).nIn(64).nOut(96).build(),"stem-batch3")
        //merge
        .addVertex("concat1", new MergeVertex(),"stem-pool1", "stem-cnn4")
        .addLayer("stem-batch4", new BatchNormalization.Builder(false).decay(0.995).eps(0.001).build(), "concat1")

This is problematic because if I have merge vertex with two or three Convolution1DLayers and a pooling layer I can’t do BatchNormalization after each Convolution1DLayer as recommended, only once after the merge vertex which may have unintended consequences for training which I have not got to yet such as exploding gradients.

The strange thing is that if my merge vertex only contains BatchNormalization layers and no pooling layers it will work fine! It is the specific combination of pooling and BatchNormalization layers in a merge vertex which generates the error.

I am guessing this is a bug?

agibsonccc · March 17, 2021, 11:11am

@MPdaedalus mind filing an issue with a stripped down example that reproduces the issue? Thanks!

MPdaedalus · March 17, 2021, 11:44pm

Thought I’d test it with latest snapshot before filing a bug report in case it has been fixed already but javacpp not playing ball with snapshot, which version of cuda-platform-redist do I need to use for snapshot? seems only cuda 11.2 uses javacpp 1.5.5

https://mvnrepository.com/artifact/org.bytedeco/cuda-platform-redist

but deeplearning4j only supports up to 10.2?

Warning: Versions of org.bytedeco:javacpp:1.5.5 and org.bytedeco:cuda:10.2-7.6-1.5.3 do not match.
23:31:02.142 [main] INFO org.nd4j.linalg.factory.Nd4jBackend - Loaded [JCublasBackend] backend
23:31:02.153 [main] ERROR org.nd4j.common.config.ND4JClassLoading - Cannot find class [org.nd4j.linalg.jblas.JblasBackend] of provided class-loader.

I dunno why it is referencing the old version I was using with beta7 as my POM references the new version and pulled in 3GB or so of new files and I ran maven clean and force update.

here is the important parts of my pom.xml

<dependency>
        <groupId>org.deeplearning4j</groupId>
        <artifactId>deeplearning4j-ui</artifactId>
        <version>${dl4j.version}</version>
	</dependency>
	<dependency>
		<groupId>org.eclipse.collections</groupId>
		<artifactId>eclipse-collections-api</artifactId>
		<version>10.4.0</version>
	</dependency>
	<dependency>
		<groupId>org.eclipse.collections</groupId>
		<artifactId>eclipse-collections</artifactId>
		<version>10.4.0</version>
	</dependency>
	<dependency>
		<groupId>org.nd4j</groupId>
		<artifactId>nd4j-cuda-10.2</artifactId>
		<version>${dl4j.version}</version>
	</dependency>
	<dependency>
		<groupId>org.deeplearning4j</groupId>
		<artifactId>deeplearning4j-core</artifactId>
		<version>${dl4j.version}</version>
	</dependency>
	<dependency>
		<groupId>org.deeplearning4j</groupId>
		<artifactId>deeplearning4j-cuda-10.2</artifactId>
		<version>${dl4j.version}</version>
	</dependency>
	<dependency>
		<groupId>org.bytedeco</groupId>
		<artifactId>cuda-platform-redist</artifactId>
		<version>11.2-8.1-1.5.5</version>
	</dependency>
	<dependency>
		<groupId>org.deeplearning4j</groupId>
		<artifactId>deeplearning4j-zoo</artifactId>
		<version>${dl4j.version}</version>
	</dependency>
	<dependency>
		<groupId>org.deeplearning4j</groupId>
		<artifactId>deeplearning4j-datavec-iterators</artifactId>
		<version>${dl4j.version}</version>
	</dependency>
	<dependency>
		<groupId>org.datavec</groupId>
		<artifactId>datavec-local</artifactId>
		<version>${dl4j.version}</version>
	</dependency>
</dependencies>

using <dl4j.version>1.0.0-SNAPSHOT</dl4j.version>
<nd4j.version>1.0.0-SNAPSHOT</nd4j.version>

saudet · March 18, 2021, 1:47am

For CUDA 10.2, these instructions here are up-to-date:
https://deeplearning4j.konduit.ai/config/backends/config-cudnn
You can ignore the warning about the versions, that’s not a problem.

MPdaedalus · March 19, 2021, 12:01am

Ok the bug is fixed in snapshot, I can mix batchNorm and Pooling no problem, the model compiles fine with snapshot but not with beta7, hopefully the bug is also fixed in beta8, due anytime now?

There is either still a problem with my cuda backend (works fine in beta7) or something else is going wrong as i’m getting java.lang.UnsupportedOperationException when Nd4j.create is called with my training data or before using>
model.setListeners(new StatsListener(statsStorage),new ScoreIterationListener(10));

, which does not occur if I switch back to beta7 or use CPU, my pom.xml is same as prev post.

Warning: Versions of org.bytedeco:javacpp:1.5.5 and org.bytedeco:cuda:10.2-7.6-1.5.3 do not match.
23:43:43.788 [main] INFO org.nd4j.linalg.factory.Nd4jBackend - Loaded [JCublasBackend] backend
23:43:43.798 [main] ERROR org.nd4j.common.config.ND4JClassLoading - Cannot find class [org.nd4j.linalg.jblas.JblasBackend] of provided class-loader.
23:43:43.799 [main] ERROR org.nd4j.common.config.ND4JClassLoading - Cannot find class [org.canova.api.io.data.DoubleWritable] of provided class-loader.
23:43:43.855 [main] ERROR org.nd4j.common.config.ND4JClassLoading - Cannot find class [org.nd4j.linalg.jblas.JblasBackend] of provided class-loader.
23:43:43.856 [main] ERROR org.nd4j.common.config.ND4JClassLoading - Cannot find class [org.canova.api.io.data.DoubleWritable] of provided class-loader.
23:43:45.711 [main] INFO org.nd4j.nativeblas.NativeOpsHolder - Number of threads used for linear algebra: 32
23:43:45.740 [main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Backend used: [CUDA]; OS: [Linux]
23:43:45.740 [main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Cores: [4]; Memory: [8.0GB];
23:43:45.741 [main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Blas vendor: [CUBLAS]
23:43:45.750 [main] INFO org.nd4j.linalg.jcublas.JCublasBackend - ND4J CUDA build version: 10.2.89
23:43:45.752 [main] INFO org.nd4j.linalg.jcublas.JCublasBackend - CUDA device 0: [GeForce GTX 1060 6GB]; cc: [6.1]; Total memory: [6373179392]
Exception in thread “main” java.lang.UnsupportedOperationException
at org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner.createShapeInfo(DefaultOpExecutioner.java:945)
at org.nd4j.linalg.api.shape.Shape.createShapeInformation(Shape.java:3279)
at org.nd4j.linalg.api.ndarray.BaseShapeInfoProvider.createShapeInformation(BaseShapeInfoProvider.java:75)
at org.nd4j.jita.constant.ProtectedCudaShapeInfoProvider.createShapeInformation(ProtectedCudaShapeInfoProvider.java:92)
at org.nd4j.jita.constant.ProtectedCudaShapeInfoProvider.createShapeInformation(ProtectedCudaShapeInfoProvider.java:73)
at org.nd4j.linalg.jcublas.CachedShapeInfoProvider.createShapeInformation(CachedShapeInfoProvider.java:42)
at org.nd4j.linalg.api.ndarray.BaseNDArray.(BaseNDArray.java:166)
at org.nd4j.linalg.api.ndarray.BaseNDArray.(BaseNDArray.java:234)
at org.nd4j.linalg.api.ndarray.BaseNDArray.(BaseNDArray.java:225)
at org.nd4j.linalg.jcublas.JCublasNDArray.(JCublasNDArray.java:72)
at org.nd4j.linalg.jcublas.JCublasNDArrayFactory.create(JCublasNDArrayFactory.java:151)
at org.nd4j.linalg.factory.Nd4j.create(Nd4j.java:3445)

Just for clarification, if i’m using cuda-platform-redist I don’t need cuda installed on my linux system just the x11 cuda drivers?

agibsonccc · March 19, 2021, 12:48pm

@MPdaedalus hm, it should be hitting this method, not the super method:

github.com

deeplearning4j/deeplearning4j/blob/46dbd0b2035fab86f5d500508af46690b97af7c8/nd4j/nd4j-backends/nd4j-backend-impls/nd4j-cuda/src/main/java/org/nd4j/linalg/jcublas/ops/executioner/CudaExecutioner.java#L2166


      
                  throw new RuntimeException(nativeOps.lastErrorMessage());
          
              val result = new CudaLongDataBuffer(nativeOps.getConstantShapeBufferPrimary(dbf), nativeOps.getConstantShapeBufferSpecial(dbf), Shape.shapeInfoLength(shape.length));
          
              nativeOps.deleteConstantShapeBuffer(dbf);
          
              return result;
          }
          
          @Override
          public DataBuffer createShapeInfo(long[] shape, long[] stride, long elementWiseStride, char order, DataType dtype, long extras) {
              if (nativeOps.lastErrorCode() != 0)
                  throw new RuntimeException(nativeOps.lastErrorMessage());
          
              val dbf = nativeOps.shapeBufferEx(shape.length, new LongPointer(shape), new LongPointer(stride), dtype.toInt(), order, elementWiseStride, extras);
          
              if (nativeOps.lastErrorCode() != 0)
                  throw new RuntimeException(nativeOps.lastErrorMessage());
          
              val result = new CudaLongDataBuffer(nativeOps.getConstantShapeBufferPrimary(dbf), nativeOps.getConstantShapeBufferSpecial(dbf), Shape.shapeInfoLength(shape.length));

Yes again coming soon, but no ETA yet. Ideally end of month at the latest. I’m still auditing the tests (we’ve had a bit of technical debt accrue that I’m currently cleaning up)

MPdaedalus · March 22, 2021, 11:35pm

ok the error has now changed with the latest snapshot but still no luck

23:20:09.587 [main] INFO org.nd4j.linalg.factory.Nd4jBackend - Loaded [JCublasBackend] backend
23:20:09.596 [main] ERROR org.nd4j.common.config.ND4JClassLoading - Cannot find class [org.nd4j.linalg.jblas.JblasBackend] of provided class-loader.
23:20:09.597 [main] ERROR org.nd4j.common.config.ND4JClassLoading - Cannot find class [org.canova.api.io.data.DoubleWritable] of provided class-loader.
23:20:09.650 [main] ERROR org.nd4j.common.config.ND4JClassLoading - Cannot find class [org.nd4j.linalg.jblas.JblasBackend] of provided class-loader.
23:20:09.651 [main] ERROR org.nd4j.common.config.ND4JClassLoading - Cannot find class [org.canova.api.io.data.DoubleWritable] of provided class-loader.
Exception in thread “main” java.lang.ExceptionInInitializerError
at org.nd4j.jita.concurrency.CudaAffinityManager.getNumberOfDevices(CudaAffinityManager.java:132)
at org.nd4j.jita.constant.ConstantProtector.purgeProtector(ConstantProtector.java:56)
at org.nd4j.jita.constant.ConstantProtector.(ConstantProtector.java:49)
at org.nd4j.jita.constant.ConstantProtector.(ConstantProtector.java:37)
at org.nd4j.jita.constant.ProtectedCudaConstantHandler.(ProtectedCudaConstantHandler.java:65)
at org.nd4j.jita.constant.CudaConstantHandler.(CudaConstantHandler.java:34)
at java.base/java.lang.Class.forName0(Native Method)
at java.base/java.lang.Class.forName(Class.java:468)
at org.nd4j.common.config.ND4JClassLoading.loadClassByName(ND4JClassLoading.java:62)
at org.nd4j.common.config.ND4JClassLoading.loadClassByName(ND4JClassLoading.java:56)
at org.nd4j.linalg.factory.Nd4j.initWithBackend(Nd4j.java:5152)
at org.nd4j.linalg.factory.Nd4j.initContext(Nd4j.java:5093)
at org.nd4j.linalg.factory.Nd4j.(Nd4j.java:270)
at org.nd4j.linalg.dataset.DataSet.(DataSet.java:111)
at org.nd4j.linalg.dataset.DataSet.(DataSet.java:94)
at org.nd4j.linalg.dataset.DataSet.(DataSet.java:67)
at main.PatternDetect.getTrainingData(PatternDetect.java:145)
at main.PatternDetect.run(PatternDetect.java:60)
at main.Start.main(Start.java:15)
Caused by: java.lang.RuntimeException: ND4J is probably missing dependencies. For more information, please refer to: https://deeplearning4j.konduit.ai/nd4j/backend
at org.nd4j.nativeblas.NativeOpsHolder.(NativeOpsHolder.java:116)
at org.nd4j.nativeblas.NativeOpsHolder.(NativeOpsHolder.java:37)
… 19 more
Caused by: java.lang.UnsatisfiedLinkError: no jnind4jcuda in java.library.path: /usr/local/cuda/lib64:/usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
at java.base/java.lang.ClassLoader.loadLibrary(ClassLoader.java:2447)
at java.base/java.lang.Runtime.loadLibrary0(Runtime.java:809)
at java.base/java.lang.System.loadLibrary(System.java:1893)
at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:1631)
at org.bytedeco.javacpp.Loader.load(Loader.java:1265)
at org.bytedeco.javacpp.Loader.load(Loader.java:1109)
at org.nd4j.nativeblas.Nd4jCuda.(Nd4jCuda.java:10)
at java.base/java.lang.Class.forName0(Native Method)
at java.base/java.lang.Class.forName(Class.java:468)
at org.nd4j.common.config.ND4JClassLoading.loadClassByName(ND4JClassLoading.java:62)
at org.nd4j.common.config.ND4JClassLoading.loadClassByName(ND4JClassLoading.java:56)
at org.nd4j.nativeblas.NativeOpsHolder.(NativeOpsHolder.java:88)
… 20 more
Caused by: java.lang.UnsatisfiedLinkError: no nd4jcuda in java.library.path: /usr/local/cuda/lib64:/usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
at java.base/java.lang.ClassLoader.loadLibrary(ClassLoader.java:2447)
at java.base/java.lang.Runtime.loadLibrary0(Runtime.java:809)
at java.base/java.lang.System.loadLibrary(System.java:1893)
at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:1631)
at org.bytedeco.javacpp.Loader.load(Loader.java:1213)
… 27 more

I don’t understand why it is looking for the native libs on my machine when I have cuda-platform-redist in my pom.xml

		<groupId>org.bytedeco</groupId>
		<artifactId>cuda-platform-redist</artifactId>
		<version>10.2-7.6-1.5.3</version>
	</dependency>
<dependency>
		<groupId>org.deeplearning4j</groupId>
		<artifactId>deeplearning4j-cuda-10.2</artifactId>
		<version>${dl4j.version}</version>
	</dependency>
<dependency>
		<groupId>org.nd4j</groupId>
		<artifactId>nd4j-cuda-10.2</artifactId>
		<version>${dl4j.version}</version>
	</dependency>

I should add that I have installed cuda 10.2 and cuDNN 7.6.5 and made sure libcudnn.so.7.6.5 is in /usr/local/cuda/lib64 but it makes no difference.

have the snapshots been tested to work correctly in eclipse not just jetbrains IDE? because i’m running out of ideas, or is the snapshot just so unstable that it can’t be used?

MPdaedalus · March 23, 2021, 12:51pm

further investigation has found that if I add “platform” instead of just “nd4j-cuda-10.2” as mentioned in the docs the error in the previous post disappears and I am back to the previous error with UnsupportedOperationException in createShapeInfo

Exception in thread “main” java.lang.UnsupportedOperationException
at org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner.createShapeInfo(DefaultOpExecutioner.java:945)

this error also comes up for everything I run such as the examples LinearDataClassifier and IrisClassifier. It also comes up regardless of whether I am using cuda-platform-redist or relying on the native binaries in my /usr/local/cuda/lib64 folder which seems to indicate to me it is a problem with the snapshot and not with my cuda backend, (like I said before the cuda backend works fine in beta7),
so I’m going to file a bug report for the error.

		<groupId>org.nd4j</groupId>
		<artifactId>nd4j-cuda-10.2-platform</artifactId>
		<version>${dl4j.version}</version>
	</dependency>

however there seem to be builds missing for certain OSes that cause maven and eclipse to complain with the platform release. I don’t use them but its needed for proper build.

Missing artifact org.nd4j:nd4j-cuda-10.2:jar:linux-ppc64le:1.0.0-SNAPSHOT
Missing artifact org.nd4j:nd4j-cuda-10.2:jar:windows-x86_64:1.0.0-SNAPSHOT

MPdaedalus · April 3, 2021, 9:25pm

problem is fixed in cuda 11.2 snapshot

Topic		Replies	Views
More Batch normalization problems with 1dconvolution DL4J	1	358	April 9, 2021
How to configure BatchNormalization layers after Convolution1D and LSTM layers? DL4J	4	95	June 11, 2024
MergeVertex behaviour with both Convolutional and Regular layers DL4J	1	167	June 23, 2023
RnnToCnnPreProcessor DL4J	0	470	August 22, 2020
How to shape input correctly for Convolution1D? DL4J	2	1419	August 6, 2020

RNN and FF getting mixed up in MergeVertex, new FileStatsStorage not work

Related topics