I decided to stick to your advice @treo and implemented my own SameDiffLayer
which would basically represent a single transformer’s encoder (attention layer + dense layer + normalization). Unfortunately I got stuck with 2 issues:
- In order to feed forward the attention layer results into the dense layer (in my case I wanted to use
sameDiff.nn.reluLayer()
) I need to reshape the RNN format into FFN, but using directly a RnnToFeedForwardPreProcessor is not an option because the latter works with INDArrays, not SDVariables. Also direct reshaping didn’t work because in publicSDVariable defineLayer()
method, where my new logic resides, it’s impossible to get the access to thebatchSize
inSDVariable layerInput
(makes sense, because it’s not yet there) and I couldn’t find any workarounds (e.g. some lazy-like retrieval ofbatchSize
fromlayerInput
) - At some point I’m gonna need the second optional output of
MultiHeadDotProductAttention
- attention weights. I tried to retrieve them asINDArray
inside mySameDiffLayer
implementation but because it’s anARRAY
- I can’t fetch it directly from outside. Thought of usingSameDiffLayer.getLayerParams()
but I’m not sure it’s gonna give me exactly the weights which I need (I wanted the info structure which is described on the page 13 of original paper: https://arxiv.org/pdf/1706.03762.pdf )
I can workaround the first issue by using existing SelfAttentionLayer
and adding after it the DenseLayer
and then normalization into the model’s layer ListBuilder (it will be most probably less efficient than having own SameDiffLayer combining it all at once) but it won’t help me anyway in resolving the issue #2 (I need those attention weights for the further linguistical analysis)