Transformer's Encoder with Self-Attention layers

partarstu · December 6, 2020, 11:53am

I decided to stick to your advice @treo and implemented my own SameDiffLayer which would basically represent a single transformer’s encoder (attention layer + dense layer + normalization). Unfortunately I got stuck with 2 issues:

In order to feed forward the attention layer results into the dense layer (in my case I wanted to use sameDiff.nn.reluLayer() ) I need to reshape the RNN format into FFN, but using directly a RnnToFeedForwardPreProcessor is not an option because the latter works with INDArrays, not SDVariables. Also direct reshaping didn’t work because in public SDVariable defineLayer() method, where my new logic resides, it’s impossible to get the access to the batchSize in SDVariable layerInput (makes sense, because it’s not yet there) and I couldn’t find any workarounds (e.g. some lazy-like retrieval of batchSize from layerInput )
At some point I’m gonna need the second optional output of MultiHeadDotProductAttention - attention weights. I tried to retrieve them as INDArray inside my SameDiffLayer implementation but because it’s an ARRAY - I can’t fetch it directly from outside. Thought of using SameDiffLayer.getLayerParams() but I’m not sure it’s gonna give me exactly the weights which I need (I wanted the info structure which is described on the page 13 of original paper: https://arxiv.org/pdf/1706.03762.pdf )

I can workaround the first issue by using existing SelfAttentionLayer and adding after it the DenseLayer and then normalization into the model’s layer ListBuilder (it will be most probably less efficient than having own SameDiffLayer combining it all at once) but it won’t help me anyway in resolving the issue #2 (I need those attention weights for the further linguistical analysis)

Topic		Replies	Views
SelfAttention Token Training Example DL4J	2	258	January 5, 2023
Attention and Pooling Problem with Merge on Backpropagation DL4J	9	758	October 30, 2021
Autoencode sanity check DL4J	3	499	January 29, 2023
Creating custom layer in java Deeplearning4J DL4J	1	727	January 14, 2021
LSTM Training stops working in Snapshot	4	793	June 25, 2021

Transformer's Encoder with Self-Attention layers

Related topics