Transformer's Encoder with Self-Attention layers

I decided to stick to your advice @treo and implemented my own SameDiffLayer which would basically represent a single transformer’s encoder (attention layer + dense layer + normalization). Unfortunately I got stuck with 2 issues:

  1. In order to feed forward the attention layer results into the dense layer (in my case I wanted to use sameDiff.nn.reluLayer() ) I need to reshape the RNN format into FFN, but using directly a RnnToFeedForwardPreProcessor is not an option because the latter works with INDArrays, not SDVariables. Also direct reshaping didn’t work because in public SDVariable defineLayer() method, where my new logic resides, it’s impossible to get the access to the batchSize in SDVariable layerInput (makes sense, because it’s not yet there) and I couldn’t find any workarounds (e.g. some lazy-like retrieval of batchSize from layerInput )
  2. At some point I’m gonna need the second optional output of MultiHeadDotProductAttention - attention weights. I tried to retrieve them as INDArray inside my SameDiffLayer implementation but because it’s an ARRAY - I can’t fetch it directly from outside. Thought of using SameDiffLayer.getLayerParams() but I’m not sure it’s gonna give me exactly the weights which I need (I wanted the info structure which is described on the page 13 of original paper: https://arxiv.org/pdf/1706.03762.pdf )

I can workaround the first issue by using existing SelfAttentionLayer and adding after it the DenseLayer and then normalization into the model’s layer ListBuilder (it will be most probably less efficient than having own SameDiffLayer combining it all at once) but it won’t help me anyway in resolving the issue #2 (I need those attention weights for the further linguistical analysis)