Optimization pack

jijiji · May 15, 2021, 10:35am

Hi! I have experience debugging and optimizing apps upto low level optimization and can find and optimize stuff to work faster and more stable. Here is a list of suggestions of fixes, improvements, optimizations and new features. Some of them may be partially implemented. Some of them are hard for me to implement so it’s more like a suggestions.

Here is the list:

CUDA integration opmitization. .cu files seems to be not optimized. It might be hard to split layer computation onto all CUDA units. This is caused by:
Explanation
1. Wrong task separation. For example small resolution cnns with small minibatch size.
2. Unoptimized calls in .cu files. Too much method invocations. Methods with “inline” identifier might work better. Too much arguments. Passing arguments throw may cost time. I been checking it on Java and it might be slowing here too. Using one pointer argument may give better performance. (unchecked)
3. Convolutional matrix multiplication may not be optimized (org.deeplearning4j.nn.conf.layers.ConvolutionLayer.FwdAlgo, AlgoMode.PREFER_FASTEST and such). It may even not work correctly. Every convolution may act different with different algorithm. Fastest algorithm may not be found correctly, it might be different for each convolution layer.

Backend speed optimization. Some arrays operations have faster performance on CPU and very slow on GPU. It would be faster to calculate some operations by a platform where they do faster behaviour. Just simply add tag to op where it should be calculated by default.
Backend memory optimization. Especially for CUDA backend. Both HOST and DEVICE allocation modes leave arrays allocated on device.

Explanation

Computation graph use to store all the data on GPU. If GPU don’t have enough memory, an exception will be thrown. Almost any cnn (and might be other layers) can be computed on GPU with very little amount of RAM, just by separating layers calculations, or by separating segments. I offer 4 modes memory allocation: full allocation (fastest), separation per layer (average), separation per segment (slower), auto. With separation per segment mode an image with resolution 10 000 x 10 000 x 3 pixels can be split onto many pictures and processed in a different way. For example 10x10 convolution with kernel 3x3 may be split onto 4 6x6 convolution. Separation depends on kernel config. The less kernel is than higher efficiency of this method. It would allow to use very high batch sizes. Separation per layer would allow limiting batch size by layer requirements. It is much faster than segment separation, but won’t allow such high memory usage. Using this method would allow calculation of very big data sets and neural networks.
Here is a way to implement it by adding new universal backend or changing CUDA backend (last not recommended)

Data type optimization. RFLOAT16, RFLOAT8, URFLOAT8, RFLOAT4 and such. Here is FLOAT16, FLOAT32, FLOAT64 data types used to be calculated. But here is a way for faster calculation.
Explanation
It is possible to do calculations fasted using new data types. Let’s call them relative float (RFLOAT[size]). Based on UINT they don’t need exponent bits. They may contain values [0;1], [-1;1]. Human brain works on signals which can be decoded to values from zero to one. It is most often used interval for neural networks. Weights may have very different values, but activation values except [-1;1] are not used often. They may contain in neural activation more likely because of wrong/custom architecture or data set. Let’s explain data types:
1. RFLOAT16 and URFLOAT16. Using 2 bytes of memory with 65536 values including zero. Standard FLOAT16 may have round 1024 values [0;1] or 2048 values [-1;1] with reduced precision with raising exponent. RFLOAT16 offers 65536 values to be used constantly having 64 times more precision than FLOAT16.
2. RFLOAT8, URFLOAT8, URFLOAT4, URFLOAT2. Using 1 or less bytes of memory can contain from 4 to 256 values. It is enough for special neural network architectures. Multiple values can be wrote in a single byte reducing memory consumption.
3. URFLOAT2 or boolean. Uring 1 bit can contain 2 values. Despite of it only can contain 2 values representin it can be used to create impulsed neural networks (like human brain networks). Impulsed neural networks can have less memory consumption and very high speed. Boolean represents if neuron is activated or not.
  =============
  At my opinion URFLOAT16 and URFLOAT8 are most relevant now. It is required to config neural networks to have different data types for weights and activations, also allowing different data types for each layer activation.

Auto adjusting neural networks, universal layers, paramethrizing layers size and type. Like in real human brain, almost everything in neural network can be paramethrized even layer type and size. Neural network can adjust it’s own architecture to get optimal speed, memory usage and precision. It is possible to allow layers to add/remove neurons, add/remove new layers and even change layers type (for example from convolution layer to dense layer, or from non recurrent layer to recurrent layer). Here is a way to implement it in real. Starting with simple adding neurons for example by adding paramethrized channel size for convolution. Limit it’s min and max values. When memory consumption is too high it will be reduced. When precision is too low it will be increased. So some balance will be found. With training progress it is possible to find a way to get good precision using less amount of neurols. But statring with that amount of neurons may make training very hard.
Neural state, switchable parametes. Summary: not only layers can be changed. It is possible to allow neural network switch layer parameters depending of situation.
Repeatable convolution. Let’s imagine we have 56x56 picture which can contain 28x28 numeric image on it’s corner or big number at middle. It would not be so effective to train different convolution layers for different resolution. It is possible to process original image with additional image which is reduced original image with resolution 28x28. Pictures with resolution 1000x1000 can be processed by convolution layer made for 10x10 picture by processing original image with stride and dilation, reducing it’s resolution with reducing rate. It affects in a faster performance, less memory consumption, easier neural architecture developing (don’t have to write subsample and merge vertices).

This is not full list.

Topic		Replies	Views
Optimization question DL4J	20	907	May 30, 2021
Significant Performance Drop in `nn.fit()` method After Upgrading to NVIDIA 3060 and CUDA 11.6 DL4J	7	210	November 29, 2023
DL4J and Mobile GPU Feature Request	1	454	October 9, 2020
25% GPU Usage on 1080 ti DL4J	9	585	February 14, 2022
Configuring GPU-Backend in Maven Dependencies DL4J	10	1437	May 27, 2021

Optimization pack

Related topics