Hi. I’m trying to implement a Transformer’s Encoder (similar to the one in BERT) using a Self-Attention with normalization followed by a FFN and output layer with Softmax. I’m using SelfAttentionLayer.java for those purposes and took a look at https://github.com/eclipse/deeplearning4j/blob/master/deeplearning4j/deeplearning4j-core/src/test/java/org/deeplearning4j/nn/dtypes/DTypeTests.java and tried to replicate the config. My use case is to break the corpus into the series of token sequences (as far as I understand for SelfAttentionLayer 1 sequence is a batch size) 8 tokens each with a word embedding size of 100 which means an input of the shape [8, 100] for each series. I expect the output as well to be [8, 100]. But SelfAttentionLayer expects an input of shape [batchSize, features, timesteps] and I still can’t understand what timesteps in my case means. In the code itself it’s sometimes referred to as “seqLength” and it’s quite confusing for me. As I’m implementing an encoder, there should be no recurrent behavior as in the case of a Decoder (a case of translation use case) and SelfAttentionLayer should process the whole sequence of 8 tokens at the same time providing an output (with applied attention weights) which projects an input. I was trying to find any examples which could clarify the meaning of that “timesteps” dimension but I failed. Taking into account the fact that @treo has implemented this functionality, any help from him or someone else who knows this domain is highly appreciated.
Thanks in advance