Optimization question

Hi! I saw many stuff is not optimized in dl4j, nd4j, SameDiff. For example, .cu files, memory management. I guess i would be able to fix some of that. I spent many time debugging and optimizing apps. Here is a list of suggestions, fixes and optimizations i can write. It may take me few hours to write explanation of that so i am asking if you are interested in that?

This stuff can increase compute capability (may be upto 20 times or more), decrease memory consumption and make new solutions for data analysis. This is not final list of features. Let me know if you want me to explain that.

Sure, just send us pull requests! We’re always open to those.

Thank you for your answer. It would take a while to write

@jijiji please feel free to coordinate with us either here or via github issues for what you want to optimize. There’s likely areas of the code base that we’re either thinking of optimizing, parts of it that we might deprecate or something that exists in the c++ part of the code that may supplant work done at the java level.

Here is a list posted. It is not full cause it’s big and i may update it later. You can check it

Where though? I don’t see anything on your post nor github issues.

@jijiji I guess I was looking for something more specific. What are you targeting for optimization? The cmake build? Cuda code in libnd4j? Workspaces + java interaction?

If you have something in mind, it would be easier to see if we already have some priorities that we know need working on.

It’s a complex optimization affecting all the stuff you listed. You could ask me about special item and i could explain it. Items 4, 5 and 6 are more long termed, but items 1, 2 and 3 would give lots of optimization. Ask something specific i will explain

I guess let’s just aim for something small and concrete that would allow a quick win that way we actually ship something. Longer term things (especially on an open source volunteer basis) sometimes don’t necessarily ever get merged, either due to time or quality constraints. Could you outline what your first target is? Generally smaller pull requests that are easy to review and test are better than large, monolithic pull requests.

2, 3 may be 7, and partially 4 could be written pretty fast. 2 is longer

Here is an example. I used to cache data sets, but got out of memory exception. Host workspace allocation won’t work. After i tried to cache data on CPU memory and then recreate CUDA arrays it worked. I got around 20% learning speed boost.

This is an example of possible moving CUDA array to CPU array

final MemoryWorkspace currentWorkspace =         
Nd4j.getMemoryManager().getCurrentWorkspace();
Nd4j.getMemoryManager().setCurrentWorkspace(cacheWorkspace);
cacheWorkspace.notifyScopeBorrowed();
if (!array.dataType().equals(DataType.BYTE))
{
	throw new UnsupportedOperationException();
}

final byte[ ] asBytes = array.data().asBytes();
final Int8Buffer buffer = new Int8Buffer(asBytes.length, true, cacheWorkspace);
buffer.setData(asBytes);
final NDArray leverageTo = new NDArray(
	buffer,
	array.shape(),
	array.stride(),
	array.offset(),
	array.ordering(),
	DataType.BYTE);


cached.put(file, leverageTo);
cacheWorkspace.notifyScopeLeft();
Nd4j.getMemoryManager().setCurrentWorkspace(currentWorkspace);
if (leverageTo.wasClosed()) //|| Nd4j.getWorkspaceManager().getWorkspaceForCurrentThread().getGenerationId() != leverageTo.data().getGenerationId())
{
	throw new IllegalStateException();
}

where

	cacheWorkspace = new CpuWorkspace(
		WorkspaceConfiguration
			.builder()
			.policyAllocation(AllocationPolicy.STRICT)
			.policyLocation(LocationPolicy.RAM)
			.policyLearning(LearningPolicy.NONE)
			.policyReset(ResetPolicy.BLOCK_LEFT)
			.policyMirroring(MirroringPolicy.HOST_ONLY)
			.policySpill(SpillPolicy.FAIL)
			.initialSize(4L * 1024L * 1024L * 1024L)
			.minSize(4L * 1024L * 1024L * 1024L)
			.maxSize(4L * 1024L * 1024L * 1024L)
			.build(),
		id);

@jijiji we have a lot of that work done in c++ actually. I would be all for improvements to the workspaces docs if you want to help write an example on how to optimize the usage of the workspaces as it stands.

Sure. I will write some code

It would take me few week to implement.

@jijiji no hurry. Also if it ends up being too much that’s also fine. That’s why I suggested a scoped down project first.

Is here a way to disallow keeping mask and activation of each layer during neural network training? If i put scale/shift/merge vertices they consume very much RAM, i would like to calculate them second time during backprop. May be here is something like PipelineLayer so i can hide it’s layers?

And how to speed up convolutions like input [1, 1, 1000, 1000], kernel [2, 2], nOut 1? Will it be faster by reshaping it to cnn3d like [1, 1, 100, 100, 100] or move some to additional channels/batch?

I guess small convolutions with larger channels size works faster than high resolution single channel convolutions with same total size.

Here is some pre tests. I got tired so i cannot fix some errors right now.

Here is some weird results with performance. Some layers should be faster than another. Here is no explanation now, only raw data to check.

Take a look at 16 batch size 16 channels. Normal convolution took 716 ms, but cnn2d 16 channels shape [256, 1, 28, 28] took almost 5 seconds while it has 16 * 16 times less calculations, because of 1 filter per channel. So it should run about 3 ms in theiry or at least 200 ms in real.

Something is going wrong here and/or I’m tired. Currently I am tired and barely understand things, forget stuff and can’t focus on tests. So I realized just to post it here so you can check raw data if you want.

Info: “Layer simulated compact cnn2d 16 channels shape” with 16 input channels is a 1 channel convolution cause it splits input size by 16.

Also check Single batch here is some fast (about 15 times faster than cnn2d) results.

Due to this info I guess that cnn3d is a lot more optimized than cnn2d. It might be because of CUDA matrix management. Anyway 16 channels is very small amount and tests with higher channels size are required to get more correct info. Here is only first tests (single batch) are with 256 channels size, but they’re single batch.

I think here is a way to move batch/channels dimensions to another to optimize CUDA units management so all the CUDA cores get equals tasks. I succeed it with single batch, single channel tests went wrong because of error with shapes.

Output, a but cut cause of limit (long, for debug)
Starting tests
#### Single batch ####
Starting tests with minibatch size 1 nIn 256 nOut 256
03:53:11.149 [main] INFO org.nd4j.linalg.factory.Nd4jBackend - Loaded [JCublasBackend] backend
03:53:11.156 [main] ERROR org.nd4j.common.config.ND4JClassLoading - Cannot find class [org.nd4j.linalg.jblas.JblasBackend] of provided class-loader.
03:53:11.157 [main] ERROR org.nd4j.common.config.ND4JClassLoading - Cannot find class [org.canova.api.io.data.DoubleWritable] of provided class-loader.
03:53:11.161 [main] ERROR org.nd4j.common.config.ND4JClassLoading - Cannot find class [org.nd4j.linalg.jblas.JblasBackend] of provided class-loader.
03:53:11.161 [main] ERROR org.nd4j.common.config.ND4JClassLoading - Cannot find class [org.canova.api.io.data.DoubleWritable] of provided class-loader.
03:53:12.850 [main] INFO org.nd4j.nativeblas.NativeOpsHolder - Number of threads used for linear algebra: 32
03:53:12.893 [main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Backend used: [CUDA]; OS: [Windows 10]
03:53:12.893 [main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Cores: [8]; Memory: [8,0GB];
03:53:12.893 [main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Blas vendor: [CUBLAS]
03:53:12.907 [main] INFO org.nd4j.linalg.jcublas.JCublasBackend - ND4J CUDA build version: 11.0.221
03:53:12.908 [main] INFO org.nd4j.linalg.jcublas.JCublasBackend - CUDA device 0: [NVIDIA GeForce RTX 3060]; cc: [8.6]; Total memory: [12884901888]
03:53:12.908 [main] INFO org.nd4j.linalg.jcublas.JCublasBackend - Backend build information:
MSVC: 192829914
STD version: 201402L
CUDA: 11.0.221
DEFAULT_ENGINE: samediff::ENGINE_CUDA
HAVE_FLATBUFFERS
03:53:12.932 [main] INFO org.deeplearning4j.nn.graph.ComputationGraph - Starting ComputationGraph with WorkspaceModes set to [training: ENABLED; inference: ENABLED], cacheMode set to [NONE]
03:53:14.124 [main] DEBUG org.deeplearning4j.nn.layers.HelperUtils - Attempting to initialize cudnn helper org.deeplearning4j.cuda.convolution.CudnnConvolutionHelper
03:53:14.726 [main] DEBUG org.deeplearning4j.nn.layers.HelperUtils - Cudnn helper org.deeplearning4j.cuda.convolution.CudnnConvolutionHelper successfully initialized
03:53:14.726 [main] DEBUG org.deeplearning4j.nn.layers.HelperUtils - org.deeplearning4j.cuda.convolution.CudnnConvolutionHelper successfully initialized
Testing layer normal convolution shape [1, 256, 28, 28] result millis 3792
Testing layer normal (?) 3D convolution shape [1, 256, 1, 28, 28] result millis 215
Testing layer normal separable convolution shape [1, 256, 28, 28] result millis 251
03:53:20.399 [main] DEBUG org.deeplearning4j.nn.layers.HelperUtils - Attempting to initialize cudnn helper org.deeplearning4j.cuda.convolution.CudnnConvolutionHelper
03:53:20.400 [main] DEBUG org.deeplearning4j.nn.layers.HelperUtils - Cudnn helper org.deeplearning4j.cuda.convolution.CudnnConvolutionHelper successfully initialized
03:53:20.400 [main] DEBUG org.deeplearning4j.nn.layers.HelperUtils - org.deeplearning4j.cuda.convolution.CudnnConvolutionHelper successfully initialized
Testing layer simulated compact cnn2d 16 channels shape [16, 16, 28, 28] result millis 752
Testing layer simulated compact cnn3d 16 channels shape [1, 16, 16, 28, 28] result millis 228
Testing layer normal separable simulated compact convolution shape [16, 16, 28, 28] result millis 406
Testing layer simulated channelwise (?) cnn3d 1 channel shape [1, 1, 256, 28, 28] result millis 1118
Testing layer simulated inverted channelwise (?) cnn3d 1 channel shape [1, 256, 1, 28, 28] result millis 167
03:53:23.729 [main] DEBUG org.deeplearning4j.nn.layers.HelperUtils - Attempting to initialize cudnn helper org.deeplearning4j.cuda.convolution.CudnnConvolutionHelper
03:53:23.730 [main] DEBUG org.deeplearning4j.nn.layers.HelperUtils - Cudnn helper org.deeplearning4j.cuda.convolution.CudnnConvolutionHelper successfully initialized
03:53:23.730 [main] DEBUG org.deeplearning4j.nn.layers.HelperUtils - org.deeplearning4j.cuda.convolution.CudnnConvolutionHelper successfully initialized
Testing layer simulated channelwise cnn2d 1 channel shape [256, 1, 28, 28] result millis 5557
Testing layer normal channelwise cnn2d 1 channel shape [1, 256, 28, 28] got exception Cannot do forward pass in DepthwiseConvolution2D layer (layer name = layer, layer index = 1): input array channels does not match CNN layer configuration (data format = NCHW, data input channels = 256, [minibatch,inputDepth,height,width]=[1, 256, 28, 28]; expected input channels = 1) (layer name: layer, layer index: 1, layer type: DepthwiseConvolution2DLayer)
Testing layer normal channelwise with single channel test cnn2d 1 channel shape [256, 1, 28, 28] result millis 214
#####################################
000 215 ms ___ Layer normal (?) 3D convolution shape [1, 256, 1, 28, 28]
000 214 ms ___ Layer normal channelwise with single channel test cnn2d 1 channel shape [256, 1, 28, 28]
003 792 ms ___ Layer normal convolution shape [1, 256, 28, 28]
000 251 ms ___ Layer normal separable convolution shape [1, 256, 28, 28]
000 406 ms ___ Layer normal separable simulated compact convolution shape [16, 16, 28, 28]
001 118 ms ___ Layer simulated channelwise (?) cnn3d 1 channel shape [1, 1, 256, 28, 28]
005 557 ms ___ Layer simulated channelwise cnn2d 1 channel shape [256, 1, 28, 28]
000 752 ms ___ Layer simulated compact cnn2d 16 channels shape [16, 16, 28, 28]
000 228 ms ___ Layer simulated compact cnn3d 16 channels shape [1, 16, 16, 28, 28]
000 167 ms ___ Layer simulated inverted channelwise (?) cnn3d 1 channel shape [1, 256, 1, 28, 28]
#####################################
#### Single channel ####
Starting tests with minibatch size 256 nIn 1 nOut 256
03:53:30.700 [main] DEBUG org.deeplearning4j.nn.layers.HelperUtils - Attempting to initialize cudnn helper org.deeplearning4j.cuda.convolution.CudnnConvolutionHelper
03:53:30.701 [main] DEBUG org.deeplearning4j.nn.layers.HelperUtils - Cudnn helper org.deeplearning4j.cuda.convolution.CudnnConvolutionHelper successfully initialized
03:53:30.702 [main] DEBUG org.deeplearning4j.nn.layers.HelperUtils - org.deeplearning4j.cuda.convolution.CudnnConvolutionHelper successfully initialized
Testing layer normal convolution shape [256, 1, 28, 28] result millis 5538
Testing layer normal (?) 3D convolution shape [256, 1, 1, 28, 28] result millis 4396
Testing layer normal separable convolution shape [256, 1, 28, 28]shapeInfo Mismatched shape: [2,  186624,256,  1,186624,  8192,1,102]
Shape requested: : {186624, 1}
got exception Op [sconv2d] execution failed
Skipping layer test 'simulated compact cnn2d 16 channels' channels = 0
Skipping layer test 'simulated compact cnn3d 16 channels' channels = 0
Skipping layer test 'normal separable simulated compact convolution' channels = 0
Testing layer simulated channelwise (?) cnn3d 1 channel shape [256, 1, 1, 28, 28] result millis 4392
Testing layer simulated inverted channelwise (?) cnn3d 1 channel shape [256, 1, 1, 28, 28] result millis 4387
03:53:53.378 [main] DEBUG org.deeplearning4j.nn.layers.HelperUtils - Attempting to initialize cudnn helper org.deeplearning4j.cuda.convolution.CudnnConvolutionHelper
03:53:53.379 [main] DEBUG org.deeplearning4j.nn.layers.HelperUtils - Cudnn helper org.deeplearning4j.cuda.convolution.CudnnConvolutionHelper successfully initialized
03:53:53.379 [main] DEBUG org.deeplearning4j.nn.layers.HelperUtils - org.deeplearning4j.cuda.convolution.CudnnConvolutionHelper successfully initialized
Testing layer simulated channelwise cnn2d 1 channel shape [256, 1, 28, 28] got exception Cannot do forward pass in Convolution layer (layer name = layer, layer index = 1): input array channels does not match CNN layer configuration (data format = NCHW, data input channels = 1, [minibatch, channels, height, width]=[256, 1, 28, 28]; expected input channels = 256) (layer name: layer, layer index: 1, layer type: ConvolutionLayer)
Testing layer normal channelwise cnn2d 1 channel shape [256, 1, 28, 28] got exception Cannot do forward pass in DepthwiseConvolution2D layer (layer name = layer, layer index = 1): input array channels does not match CNN layer configuration (data format = NCHW, data input channels = 1, [minibatch,inputDepth,height,width]=[256, 1, 28, 28]; expected input channels = 256) (layer name: layer, layer index: 1, layer type: DepthwiseConvolution2DLayer)
Testing layer normal channelwise with single channel test cnn2d 1 channel shape [256, 1, 28, 28] got exception Cannot do forward pass in DepthwiseConvolution2D layer (layer name = layer, layer index = 1): input array channels does not match CNN layer configuration (data format = NCHW, data input channels = 1, [minibatch,inputDepth,height,width]=[256, 1, 28, 28]; expected input channels = 256) (layer name: layer, layer index: 1, layer type: DepthwiseConvolution2DLayer)
#####################################
004 396 ms ___ Layer normal (?) 3D convolution shape [256, 1, 1, 28, 28]
005 538 ms ___ Layer normal convolution shape [256, 1, 28, 28]
004 392 ms ___ Layer simulated channelwise (?) cnn3d 1 channel shape [256, 1, 1, 28, 28]
004 387 ms ___ Layer simulated inverted channelwise (?) cnn3d 1 channel shape [256, 1, 1, 28, 28]
#####################################
#### 16 batch size 16 channels ####
<>
#####################################
000 485 ms ___ Layer normal (?) 3D convolution shape [16, 16, 1, 28, 28]
000 201 ms ___ Layer normal channelwise cnn2d 1 channel shape [16, 16, 28, 28]
000 716 ms ___ Layer normal convolution shape [16, 16, 28, 28]
000 368 ms ___ Layer normal separable convolution shape [16, 16, 28, 28]
004 295 ms ___ Layer simulated channelwise (?) cnn3d 1 channel shape [16, 1, 16, 28, 28]
005 544 ms ___ Layer simulated compact cnn2d 16 channels shape [256, 1, 28, 28]
004 266 ms ___ Layer simulated compact cnn3d 16 channels shape [16, 1, 16, 28, 28]
000 458 ms ___ Layer simulated inverted channelwise (?) cnn3d 1 channel shape [16, 16, 1, 28, 28]
#####################################
#### Additional 1 batch size 1 channels ####
Starting tests with minibatch size 1 nIn 1 nOut 256
03:54:13.039 [main] DEBUG org.deeplearning4j.nn.layers.HelperUtils - Attempting to initialize cudnn helper org.deeplearning4j.cuda.convolution.CudnnConvolutionHelper
03:54:13.041 [main] DEBUG org.deeplearning4j.nn.layers.HelperUtils - Cudnn helper org.deeplearning4j.cuda.convolution.CudnnConvolutionHelper successfully initialized
03:54:13.041 [main] DEBUG org.deeplearning4j.nn.layers.HelperUtils - org.deeplearning4j.cuda.convolution.CudnnConvolutionHelper successfully initialized
Testing layer normal convolution shape [1, 1, 28, 28] result millis 206
Testing layer normal (?) 3D convolution shape [1, 1, 1, 28, 28] result millis 94
Testing layer normal separable convolution shape [1, 1, 28, 28]shapeInfo Mismatched shape: [2,  729,256,  1,729,  8192,1,102]
Shape requested: : {729, 1}
got exception Op [sconv2d] execution failed
Skipping layer test 'simulated compact cnn2d 16 channels' channels = 0
Skipping layer test 'simulated compact cnn3d 16 channels' channels = 0
Skipping layer test 'normal separable simulated compact convolution' channels = 0
Testing layer simulated channelwise (?) cnn3d 1 channel shape [1, 1, 1, 28, 28] result millis 98
Testing layer simulated inverted channelwise (?) cnn3d 1 channel shape [1, 1, 1, 28, 28] result millis 94
03:54:13.659 [main] DEBUG org.deeplearning4j.nn.layers.HelperUtils - Attempting to initialize cudnn helper org.deeplearning4j.cuda.convolution.CudnnConvolutionHelper
03:54:13.660 [main] DEBUG org.deeplearning4j.nn.layers.HelperUtils - Cudnn helper org.deeplearning4j.cuda.convolution.CudnnConvolutionHelper successfully initialized
03:54:13.660 [main] DEBUG org.deeplearning4j.nn.layers.HelperUtils - org.deeplearning4j.cuda.convolution.CudnnConvolutionHelper successfully initialized
Testing layer simulated channelwise cnn2d 1 channel shape [1, 1, 28, 28] result millis 209
Testing layer normal channelwise cnn2d 1 channel shape [1, 1, 28, 28] result millis 95
Testing layer normal channelwise with single channel test cnn2d 1 channel shape [1, 1, 28, 28] result millis 91
#####################################
000 094 ms ___ Layer normal (?) 3D convolution shape [1, 1, 1, 28, 28]
000 095 ms ___ Layer normal channelwise cnn2d 1 channel shape [1, 1, 28, 28]
000 091 ms ___ Layer normal channelwise with single channel test cnn2d 1 channel shape [1, 1, 28, 28]
000 206 ms ___ Layer normal convolution shape [1, 1, 28, 28]
000 098 ms ___ Layer simulated channelwise (?) cnn3d 1 channel shape [1, 1, 1, 28, 28]
000 209 ms ___ Layer simulated channelwise cnn2d 1 channel shape [1, 1, 28, 28]
000 094 ms ___ Layer simulated inverted channelwise (?) cnn3d 1 channel shape [1, 1, 1, 28, 28]
#####################################
#### Additional 256 batch size 16 channels ####
Starting tests with minibatch size 256 nIn 16 nOut 256
03:54:14.143 [main] DEBUG org.deeplearning4j.nn.layers.HelperUtils - Attempting to initialize cudnn helper org.deeplearning4j.cuda.convolution.CudnnConvolutionHelper
03:54:14.144 [main] DEBUG org.deeplearning4j.nn.layers.HelperUtils - Cudnn helper org.deeplearning4j.cuda.convolution.CudnnConvolutionHelper successfully initialized
03:54:14.144 [main] DEBUG org.deeplearning4j.nn.layers.HelperUtils - org.deeplearning4j.cuda.convolution.CudnnConvolutionHelper successfully initialized
Testing layer normal convolution shape [256, 16, 28, 28] result millis 6092
Testing layer normal (?) 3D convolution shape [256, 16, 1, 28, 28] result millis 5172
Testing layer normal separable convolution shape [256, 16, 28, 28] result millis 3025
03:54:31.323 [main] DEBUG org.deeplearning4j.nn.layers.HelperUtils - Attempting to initialize cudnn helper org.deeplearning4j.cuda.convolution.CudnnConvolutionHelper
03:54:31.324 [main] DEBUG org.deeplearning4j.nn.layers.HelperUtils - Cudnn helper org.deeplearning4j.cuda.convolution.CudnnConvolutionHelper successfully initialized
03:54:31.324 [main] DEBUG org.deeplearning4j.nn.layers.HelperUtils - org.deeplearning4j.cuda.convolution.CudnnConvolutionHelper successfully initialized
Testing layer simulated compact cnn2d 16 channels shape [4096, 1, 28, 28]03:54:32.473 [main] WARN org.deeplearning4j.cuda.convolution.CudnnConvolutionHelper - Error getting CuDNN forward algorithm - falling back on IMPLICIT_GEMM
03:54:32.499 [main] WARN org.deeplearning4j.nn.layers.convolution.ConvolutionLayer - CuDNN execution failed - falling back on built-in implementation
java.lang.RuntimeException: CuDNN error = 8: CUDNN_STATUS_EXECUTION_FAILED during forward pass - step cudnnConvolutionForward: inputShape=[4096, 1, 28, 28], weightsShape=[256, 1, 2, 2], biasShape=[1, 256], kernel=[2, 2], stride=[1, 1], padding=[0, 0], dilation=[1, 1], AlgoMode=USER_SPECIFIED, fwdAlgo=IMPLICIT_GEMM, convolutionMode=Truncate
  <>
  at PerformanceTest.main(PerformanceTest.java:45)
got exception cudaMalloc failed; Bytes: [6115296256]; Error code [2]; DEVICE [0]
Testing layer simulated compact cnn3d 16 channels shape [256, 1, 16, 28, 28] got exception Cannot invoke "org.nd4j.linalg.api.memory.pointers.PagedPointer.withOffset(long, long)" because the return value of "org.nd4j.linalg.api.memory.pointers.PointersPair.getDevicePointer()" is null
<> exception Cannot invoke "org.nd4j.linalg.api.memory.pointers.PagedPointer.withOffset(long, long)" because the return value of "org.nd4j.linalg.api.memory.pointers.PointersPair.getDevicePointer()" is null
03:54:33.906 [main] DEBUG org.deeplearning4j.nn.layers.HelperUtils - Attempting to initialize cudnn helper org.deeplearning4j.cuda.convolution.CudnnConvolutionHelper
03:54:33.907 [main] DEBUG org.deeplearning4j.nn.layers.HelperUtils - Cudnn helper org.deeplearning4j.cuda.convolution.CudnnConvolutionHelper successfully initialized
03:54:33.907 [main] DEBUG org.deeplearning4j.nn.layers.HelperUtils - org.deeplearning4j.cuda.convolution.CudnnConvolutionHelper successfully initialized
Testing layer simulated channelwise cnn2d 1 channel shape [4096, 1, 28, 28] got exception Cannot do forward pass in Convolution layer (layer name = layer, layer index = 1): input array channels does not match CNN layer configuration (data format = NCHW, data input channels = 1, [minibatch, channels, height, width]=[4096, 1, 28, 28]; expected input channels = 256) (layer name: layer, layer index: 1, layer type: ConvolutionLayer)
Testing layer normal channelwise cnn2d 1 channel shape [256, 16, 28, 28] got exception Cannot do forward pass in DepthwiseConvolution2D layer (layer name = layer, layer index = 1): input array channels does not match CNN layer configuration (data format = NCHW, data input channels = 16, [minibatch,inputDepth,height,width]=[256, 16, 28, 28]; expected input channels = 256) (layer name: layer, layer index: 1, layer type: DepthwiseConvolution2DLayer)
Testing layer normal channelwise with single channel test cnn2d 1 channel shape [4096, 1, 28, 28] got exception Cannot do forward pass in DepthwiseConvolution2D layer (layer name = layer, layer index = 1): input array channels does not match CNN layer configuration (data format = NCHW, data input channels = 1, [minibatch,inputDepth,height,width]=[4096, 1, 28, 28]; expected input channels = 256) (layer name: layer, layer index: 1, layer type: DepthwiseConvolution2DLayer)
#####################################
005 172 ms ___ Layer normal (?) 3D convolution shape [256, 16, 1, 28, 28]
006 092 ms ___ Layer normal convolution shape [256, 16, 28, 28]
003 025 ms ___ Layer normal separable convolution shape [256, 16, 28, 28]
#####################################
Done!
Results
#### Single batch ####
000 215 ms ___ Layer normal (?) 3D convolution shape [1, 256, 1, 28, 28]
000 214 ms ___ Layer normal channelwise with single channel test cnn2d 1 channel shape [256, 1, 28, 28]
003 792 ms ___ Layer normal convolution shape [1, 256, 28, 28]
000 251 ms ___ Layer normal separable convolution shape [1, 256, 28, 28]
000 406 ms ___ Layer normal separable simulated compact convolution shape [16, 16, 28, 28]
001 118 ms ___ Layer simulated channelwise (?) cnn3d 1 channel shape [1, 1, 256, 28, 28]
005 557 ms ___ Layer simulated channelwise cnn2d 1 channel shape [256, 1, 28, 28]
000 752 ms ___ Layer simulated compact cnn2d 16 channels shape [16, 16, 28, 28]
000 228 ms ___ Layer simulated compact cnn3d 16 channels shape [1, 16, 16, 28, 28]
000 167 ms ___ Layer simulated inverted channelwise (?) cnn3d 1 channel shape [1, 256, 1, 28, 28]
#### Single channel ####
004 396 ms ___ Layer normal (?) 3D convolution shape [256, 1, 1, 28, 28]
005 538 ms ___ Layer normal convolution shape [256, 1, 28, 28]
004 392 ms ___ Layer simulated channelwise (?) cnn3d 1 channel shape [256, 1, 1, 28, 28]
004 387 ms ___ Layer simulated inverted channelwise (?) cnn3d 1 channel shape [256, 1, 1, 28, 28]
#### 16 batch size 16 channels ####
000 485 ms ___ Layer normal (?) 3D convolution shape [16, 16, 1, 28, 28]
000 201 ms ___ Layer normal channelwise cnn2d 1 channel shape [16, 16, 28, 28]
000 716 ms ___ Layer normal convolution shape [16, 16, 28, 28]
000 368 ms ___ Layer normal separable convolution shape [16, 16, 28, 28]
004 295 ms ___ Layer simulated channelwise (?) cnn3d 1 channel shape [16, 1, 16, 28, 28]
005 544 ms ___ Layer simulated compact cnn2d 16 channels shape [256, 1, 28, 28]
004 266 ms ___ Layer simulated compact cnn3d 16 channels shape [16, 1, 16, 28, 28]
000 458 ms ___ Layer simulated inverted channelwise (?) cnn3d 1 channel shape [16, 16, 1, 28, 28]
#### Additional 1 batch size 1 channels ####
000 094 ms ___ Layer normal (?) 3D convolution shape [1, 1, 1, 28, 28]
000 095 ms ___ Layer normal channelwise cnn2d 1 channel shape [1, 1, 28, 28]
000 091 ms ___ Layer normal channelwise with single channel test cnn2d 1 channel shape [1, 1, 28, 28]
000 206 ms ___ Layer normal convolution shape [1, 1, 28, 28]
000 098 ms ___ Layer simulated channelwise (?) cnn3d 1 channel shape [1, 1, 1, 28, 28]
000 209 ms ___ Layer simulated channelwise cnn2d 1 channel shape [1, 1, 28, 28]
000 094 ms ___ Layer simulated inverted channelwise (?) cnn3d 1 channel shape [1, 1, 1, 28, 28]
#### Additional 256 batch size 16 channels ####
005 172 ms ___ Layer normal (?) 3D convolution shape [256, 16, 1, 28, 28]
006 092 ms ___ Layer normal convolution shape [256, 16, 28, 28]
003 025 ms ___ Layer normal separable convolution shape [256, 16, 28, 28]
Done!
Source code for tests
import java.text.NumberFormat;
import java.util.Arrays;
import java.util.Map.Entry;
import java.util.TreeMap;

import org.deeplearning4j.nn.conf.ComputationGraphConfiguration;
import org.deeplearning4j.nn.conf.NeuralNetConfiguration;
import org.deeplearning4j.nn.conf.inputs.InputType;
import org.deeplearning4j.nn.conf.layers.Convolution3D;
import org.deeplearning4j.nn.conf.layers.Convolution3D.DataFormat;
import org.deeplearning4j.nn.conf.layers.ConvolutionLayer;
import org.deeplearning4j.nn.conf.layers.DepthwiseConvolution2D;
import org.deeplearning4j.nn.conf.layers.GlobalPoolingLayer;
import org.deeplearning4j.nn.conf.layers.Layer;
import org.deeplearning4j.nn.conf.layers.OutputLayer;
import org.deeplearning4j.nn.conf.layers.SeparableConvolution2D;
import org.deeplearning4j.nn.graph.ComputationGraph;
import org.deeplearning4j.nn.modelimport.keras.preprocessors.ReshapePreprocessor;
import org.deeplearning4j.nn.weights.WeightInit;
import org.nd4j.linalg.api.ndarray.INDArray;
import org.nd4j.linalg.factory.Nd4j;
import org.nd4j.linalg.learning.config.Nesterovs;

public class PerformanceTest
{
  private static final TreeMap<String, Long> cache = new TreeMap<>();

  public static void main(final String[ ] args)
  {
  	System.out.println("Starting tests");

  	System.out.println("#### Single batch ####");
  	performTests(1, 256, 256);

  	System.out.println("#### Single channel ####");
  	performTests(256, 1, 256);

  	System.out.println("#### 16 batch size 16 channels ####");
  	performTests(16, 16, 256);

  	System.out.println("#### Additional 1 batch size 1 channels ####");
  	performTests(1, 1, 256);

  	System.out.println("#### Additional 256 batch size 16 channels ####");
  	performTests(256, 16, 256);

  	System.out.println("Done!");
  }

  private static INDArray getFeatures(final long minibatchSize, final long channels)
  {
  	return Nd4j.ones(minibatchSize, channels, 28L, 28L);
  }

  private static ComputationGraph getGraph(final long nIn, final Layer layer)
  {
  	final ComputationGraphConfiguration conf = new NeuralNetConfiguration.Builder().seed(-1).l2(0.0005) // ridge regression value
  			.updater(new Nesterovs(1e-4)).weightInit(WeightInit.XAVIER).graphBuilder().addInputs("input")
  			.setInputTypes(InputType.convolutional(28, 28, nIn)).setOutputs("output").layer("layer", layer, "input")
  			.layer("ss", new GlobalPoolingLayer.Builder().build(), "layer")
  			.layer("output", new OutputLayer.Builder().nOut(2L).build(), "ss").build();

  	final ComputationGraph net = new ComputationGraph(conf);

  	net.init();
  	return net;
  }

  private static ComputationGraph getGraphCnn3D(final long channels, final long depth, final long nOut,
  		final Layer layer)
  {
  	final ComputationGraphConfiguration conf = new NeuralNetConfiguration.Builder().seed(-1).l2(0.0005) // ridge regression value
  			.updater(new Nesterovs(1e-4)).weightInit(WeightInit.XAVIER).graphBuilder().addInputs("input")
  			.setInputTypes(InputType.convolutional3D(DataFormat.NCDHW, depth, 28, 28, channels))
  			.setOutputs("output").layer("layer", layer, "input")
  			.layer("ss", new GlobalPoolingLayer.Builder().build(), "layer")
  			.inputPreProcessor("ss", new ReshapePreprocessor(new long[ ]
  			{ -1, nOut, depth, 27, 27 }, new long[ ]
  			{ -1, nOut * depth, 27, 27 }, true))
//				.inputPreProcessor("ss",
//						new ComposableInputPreProcessor(new Cnn3DToFeedForwardPreProcessor((int) depth, 28, 28),
//								new FeedForwardToCnnPreProcessor(28, 28, depth)))
  			.layer("output", new OutputLayer.Builder().nOut(2L).build(), "ss").build();

  	final ComputationGraph net = new ComputationGraph(conf);

  	net.init();
  	return net;
  }

  private static void performTests(final long minibatchSize, final long nIn, final long nOut)
  {
  	System.out.println("Starting tests with minibatch size " + minibatchSize + " nIn " + nIn + " nOut " + nOut);

  	cache.clear();

  	//normal

  	test(0, minibatchSize, nIn, nOut);

  	test(1, minibatchSize, nIn, nOut);

  	test(2, minibatchSize, nIn, nOut);

  	//compact

  	test(3, minibatchSize, nIn, nOut);

  	test(4, minibatchSize, nIn, nOut);

  	test(9, minibatchSize, nIn, nOut);

  	//depthwise

  	test(5, minibatchSize, nIn, nOut);

  	test(6, minibatchSize, nIn, nOut);

  	test(7, minibatchSize, nIn, nOut);

  	test(8, minibatchSize, nIn, nOut);

  	test(10, minibatchSize, nIn, nOut);

  	System.out.println("#####################################");

  	final NumberFormat integerInstance = NumberFormat.getIntegerInstance();
  	integerInstance.setMinimumIntegerDigits(6);
  	for (final Entry<String, Long> entry : cache.entrySet())
  	{
  		System.out.println(integerInstance.format(entry.getValue())
  				+ /*new String(new byte[ 64 - entry.getKey().length() ]) +*/" ms ___ " + entry.getKey());
  	}

  	System.out.println("#####################################");
  }

  private static void test(final ComputationGraph graph, final String layerName, final INDArray features)
  {
  	System.out.print("Testing layer " + layerName + " shape " + Arrays.toString(features.shape()));

  	try
  	{
  		for (int i = 0; i < 20; i++)
  		{
  			graph.output(features);
  		}

  		final long t0 = System.currentTimeMillis();

  		for (int i = 0; i < 100; i++)
  		{
  			graph.output(features);
  		}

  		final long td = System.currentTimeMillis() - t0;

  		cache.put("Layer " + layerName + " shape " + Arrays.toString(features.shape()), td);
  		System.out.println(" result millis " + td);
  	}
  	catch (final Exception e)
  	{
  		System.out.println(" got exception " + e.getMessage());
  	}
  }

  private static void test(final int index, final long minibatchSize, final long nIn, final long nOut)
  {
  	switch (index)
  		{
  		case 0:
  			test(getGraph(nIn, new ConvolutionLayer.Builder().kernelSize(2, 2).nOut(nOut).build()),
  					"normal convolution", getFeatures(minibatchSize, nIn));
  			break;
  		case 1:
  			test(getGraphCnn3D(nIn, 1, nOut, new Convolution3D.Builder().kernelSize(1, 2, 2).nOut(nOut).build()),
  					"normal (?) 3D convolution", Nd4j.ones(minibatchSize, nIn, 1, 28, 28));
  			break;
  		case 2:
  			test(getGraph(nIn, new SeparableConvolution2D.Builder().kernelSize(2, 2).nOut(nOut).build()),
  					"normal separable convolution", getFeatures(minibatchSize, nIn));
  			break;
  		case 3:
  			if (nIn / 16 == 0)
  			{
  				System.out.println("Skipping layer test 'simulated compact cnn2d 16 channels' channels = 0");
  				break;
  			}
  			test(getGraph(nIn / 16, new ConvolutionLayer.Builder().kernelSize(2, 2).nOut(nOut).build()),
  					"simulated compact cnn2d 16 channels", getFeatures(minibatchSize * 16, nIn / 16));
  			break;
  		case 4:
  			if (nIn / 16 == 0)
  			{
  				System.out.println("Skipping layer test 'simulated compact cnn3d 16 channels' channels = 0");
  				break;
  			}
  			test(getGraphCnn3D(nIn / 16, 16, nOut,
  					new Convolution3D.Builder().kernelSize(1, 2, 2).nOut(nOut).build()),
  					"simulated compact cnn3d 16 channels", Nd4j.ones(minibatchSize, nIn / 16, 16, 28, 28));
  			break;
  		case 5:
  			test(getGraphCnn3D(1, nIn, nOut, new Convolution3D.Builder().kernelSize(1, 2, 2).nOut(nOut).build()),
  					"simulated channelwise (?) cnn3d 1 channel", Nd4j.ones(minibatchSize, 1, nIn, 28, 28));
  			break;
  		case 6:
  			test(getGraphCnn3D(nIn, 1, nOut, new Convolution3D.Builder().kernelSize(1, 2, 2).nOut(nOut).build()),
  					"simulated inverted channelwise (?) cnn3d 1 channel", Nd4j.ones(minibatchSize, nIn, 1, 28, 28));
  			break;
  		case 7:
  			test(getGraph(minibatchSize, new ConvolutionLayer.Builder().kernelSize(2, 2).nOut(nOut).build()),
  					"simulated channelwise cnn2d 1 channel", getFeatures(minibatchSize * nIn, 1L));
  			break;
  		case 8:
  			test(getGraph(minibatchSize, new DepthwiseConvolution2D.Builder().kernelSize(2, 2).nOut(nOut).build()),
  					"normal channelwise cnn2d 1 channel", getFeatures(minibatchSize, nIn));
  			break;
  		case 9:
  			if (nIn / 16 == 0)
  			{
  				System.out.println(
  						"Skipping layer test 'normal separable simulated compact convolution' channels = 0");
  				break;
  			}
  			test(getGraph(nIn / 16, new SeparableConvolution2D.Builder().kernelSize(2, 2).nOut(nOut).build()),
  					"normal separable simulated compact convolution", getFeatures(minibatchSize * 16, nIn / 16));
  			break;
  		case 10:
  			test(getGraph(minibatchSize, new DepthwiseConvolution2D.Builder().kernelSize(2, 2).nOut(nOut).build()),
  					"normal channelwise with single channel test cnn2d 1 channel",
  					getFeatures(minibatchSize * nIn, 1L));
  			break;
  		default:
  			throw new IllegalArgumentException("No such index " + index);
  		}

  }
}