A simple visual example of a Fast Transform Neural Network:
It’s really not that complicated and does pretty much the same things a conventional artificial neural network does (IE. switching dot products with ReLU like functions.)
A simple visual example of a Fast Transform Neural Network:
This does look interesting, but I wonder, does this lend itself to being implemented using SIMD instructions? It appears that you’d have to build up the calculation iteratively.
The fast Walsh Hadamard transform (WHT) requires nlog_2(n) add subtract operations and that algorithm is quite a short cut, because it is the equivalent of a massive n*n matrix operation. The existence of such fast algorithm is one reason to be interested, the other is the reduction in the number of parameters required compared to a conventional artificial neural network layer.
If you follow the links and end up at the github repository, you can see I have done SSE2/3 SIMD code for both the Walsh Hadamard Transform and the Fast Transform (aka Fixed Filter Bank) neural networks.
There is no GPU code because I don’t have a modern one.
I get about 5000 65536-point WHTs per second per core on a cheap laptop.
Of course you always want more!
A question is whether it is as expressive as a conventional neural network layer, I would say yes on an expressiveness per parameter basis.
Okay, I get it. You have seen a WHT matrix being built up iteratively from a 2 by 2 basis matrix. No, that is not how the fast WHT algorithm is implemented.
However the matrix viewpoint is very useful to understanding the dot product nature of the WHT and its simple mathematical properties.
Anyway, a conventional artificial neural networks has n weights per nonlinear function, where n is the width of the network. Dilution with width. If parameters were really cheap and activation functions where really expensive that might be justified. However I don’t think that is the case. You likely can do better fitting with a more constrained number of weights/parameters per activation function.
One argument is:
ReLU is a switch. On: f(x)=x, Off f(x)=0. If you think about switching an audio source on an audio amplifier. Yeh, sure the switch is ‘on’ ‘off’, but when ‘on’ it allows a complicated signal through.
Then a conventional artificial neural network is a switched composition of dot products (weighted sums.)
The dot product of a number of dot products is still a dot product.
For a particular input vector the state of all switches is known. And the system can be condensed down to an n by n matrix operating on the input vector. However the computational cost is m.n.n where m is the number of layers. So you paid an even bigger price than multiplying the input vector with a large square n by n matrix.
With a Fast Transform net the cost would be around m.n.log_2(n) and you would need far fewer parameters, but get better fitting per parameter.
That isn’t quite true, because for every layer, it depends on the outcome of
Wx+b, so you can’t just simplify it to a single matrix multiplication.
Well, anyway it all seems very expensive computationally.
With a conventional neural network there are n weights per nonlinear function each weight leading to a different dimension. Unless there are substantial linear correlations across those dimensions then it would seem to be a waste of parameters. It would be better to have each weight lead back to its own individual nonlinear function. And there are various ways of arranging that with random projections for example.
That would also explain why pruning is so effective for conventional neural networks since you are just removing false correlations and leaving true ones.
It is a very interesting topic.
Also the range, diversity and applications of random projections are very little known. However it is better than the situation in the 1980s where they were almost completely unknown and no one could even conceive of any applications.
Time trundles on with neural network research.
I suppose there are 3 criticisms you can level at the Fast Transform neural network.
1/ It is full of linear correlations across dimensions too.
2/ Each input term on its own tends to cause a split just depending on its sign.
3/ If you want more parameters per layer you have to make the network wide.
Yes that is true, however what is the cost of getting rid of correlations? A (random) projection followed by non-linear functions. A Fast Transform layer can be a random projection (linear or nonlinear) and is around the minimum cost for that and contains nonlinear functions. It can also be a more structured projection should that be interesting. It is kind of down at the baseline cost for getting rid of linear correlations between dimensions, even if a single layer does include some. Certainly 2 layers could break up correlations strongly.
A random projection before a Fast Transform layer will get rid of that simple sign based behavior. You only need to do that before the first layer in a Fast Transform neural network.
That’s true. And really you would like the net to be twice as wide as the vectors you are processing. That would allow ResNet like behavior. With current hardware there is a problem about making a Fast Transform net very wide, namely the limited size of the L1,L2,L3 caches. Making the net wider than even the L1 cache will result in a serious slowdown. Still most hardware should be able to deal with quite wide nets.
Some final observations.
Each switch has its (on/off) state determined by all the inputs. That allows (forces) boosting and collective decision making. The behavior of presumably weaker learners in the prior layer by being collected together become stronger learners. If you try to use some simpler transform that doesn’t force collective behavior so much, that seems to result in a less good system.
You can initialize the Fast Transform neural network in such a way that information passes through unchanged. That seems to help create very deep nets as complexity can build up gradually. Which seems more manageable for training algorithms. It also offers a way to grow nets by adding layers without disturbing what has previously been learned.
Both conventional and fast transform nets construct the wanted output and do decision making at the same time with switched dot products. That is an entanglement of concerns. I don’t know if that is fully justified or not.
If you looked at the code I was exclusively using evolution algorithms to train the Fast Transform Neural Networks.
That’s not what is currently done. And truly I have very little experience with back-propagation.
However I felt obligated to try it. It worked first time and quite well.
Definitely you can train Fast Transform Neural Networks with back-propagation.
Doubtless a lot of refinement is necessary.