Optimization question

Update. Fixed bugs. New data:

Here still is a bug with separable convolution with single channel shapeInfo Mismatched shape: [2, 186624,256, 1,186624, 8192,1,102] Shape requested: : {186624, 1} got exception Op [sconv2d] execution failed.

But here is new results. Single channel, batch size 256.

#### Single channel ####
004 382 ms ___ Layer normal (?) 3D convolution shape [256, 1, 1, 28, 28]
000 195 ms ___ Layer normal channelwise cnn2d 1 channel shape [256, 1, 28, 28]
000 183 ms ___ Layer normal channelwise with single channel test cnn2d 1 channel shape [256, 1, 28, 28]
005 585 ms ___ Layer normal convolution shape [256, 1, 28, 28]
004 476 ms ___ Layer simulated channelwise (?) cnn3d 1 channel shape [256, 1, 1, 28, 28]
005 509 ms ___ Layer simulated channelwise cnn2d 1 channel shape [256, 1, 28, 28]
004 448 ms ___ Layer simulated inverted channelwise (?) cnn3d 1 channel shape [256, 1, 1, 28, 28]

Normal cnn2d took 5.6 sec, ones which are better optimized for that took 0.2 sec. It is about 30 times difference. LeNetMNISTReLu.java and many convolutional neural networks run with 1 or 3 first layers. Running them on cnn3d may utilize 1/30 of possible speed or even less.

Conclusion: It is possible to run low batch (real time), or single batch (first cnn layer) with optimized level. It is possible to do calculations faster upto 30 times and probably more by tuning. I can’t imagine that neural networks I made their first layer can work 30 times slower than optimized one. I might be wrong if I miss matrix multiplication algorithm, but it is PREFER_FASTEST. DepthwiseConvolution2D is good for single channel, ConvolutionLayer3D is good for single batch. For balanced batch size and channels size will be good separable, and cnn3d. Don’t know why, but cnn2d don’t offer such performance and cnn3d is better for cnn2d calculations too.

Please check if I am wrong or not here. I’m tired cannot focus much. Bug fixing took me a lot of focus.