Transformer's Encoder with Self-Attention layers

Hi. I’m trying to implement a Transformer’s Encoder (similar to the one in BERT) using a Self-Attention with normalization followed by a FFN and output layer with Softmax. I’m using SelfAttentionLayer.java for those purposes and took a look at https://github.com/eclipse/deeplearning4j/blob/master/deeplearning4j/deeplearning4j-core/src/test/java/org/deeplearning4j/nn/dtypes/DTypeTests.java and tried to replicate the config. My use case is to break the corpus into the series of token sequences (as far as I understand for SelfAttentionLayer 1 sequence is a batch size) 8 tokens each with a word embedding size of 100 which means an input of the shape [8, 100] for each series. I expect the output as well to be [8, 100]. But SelfAttentionLayer expects an input of shape [batchSize, features, timesteps] and I still can’t understand what timesteps in my case means. In the code itself it’s sometimes referred to as “seqLength” and it’s quite confusing for me. As I’m implementing an encoder, there should be no recurrent behavior as in the case of a Decoder (a case of translation use case) and SelfAttentionLayer should process the whole sequence of 8 tokens at the same time providing an output (with applied attention weights) which projects an input. I was trying to find any examples which could clarify the meaning of that “timesteps” dimension but I failed. Taking into account the fact that @treo has implemented this functionality, any help from him or someone else who knows this domain is highly appreciated.

Thanks in advance

That is where your error in understanding is. A batch is always one or more examples. A sequence is always a series of one or more timesteps.

There is nothing inherently recurrent about those terms.

The shape you want to have if you have just a single example therefore is [1, 100, 8] or if you want to have n examples with a sequence length of at most m steps at once it would be [n, 100, m].

DL4J has a concept of masking for steps, and that way you can get sequences of different length in the same mini-batch.

But because self attention has a quadratic memory requirement based on the sequence length, you should make sure that the sequence length in a single batch is closely aligned, or you’ll waste resources.

Thank you a lot for a prompt response @treo .
It’s actually quite difficult for me to perceive that dimension structure because in the examples of self-attention mechanism I found online I saw only the dimensions of [numberOfWordsInSequence, wordEmbeddingSize]. Seems like in the current situation, because I fetch the embedding from a lookup table and automatically get the matrix of the shape [numberOfWordsInSequence, wordEmbeddingSize] I have to use INDArray’s permute() method in order to create the shape which SelfAttentionLayer expects, am I right ?
One more thing, I’ve tried running the fit() for the model which uses a SelfAttentionLayer and noticed that it’s doesn’t use concurrency (which makes it quite slow). Is it possible to somehow configure the model to run in parallel? Or could it be that I’ve missed that configuration option?

I guess that the examples were not for DL4J.

As you can see in the javadoc for SelfAttentionLayer: https://github.com/eclipse/deeplearning4j/blob/master/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/conf/layers/SelfAttentionLayer.java#L35-L51

and in https://github.com/eclipse/deeplearning4j/blob/a1fcc5f19f0f637e83252b00982b3f12b401f679/nd4j/nd4j-backends/nd4j-api-parent/nd4j-api/src/main/java/org/nd4j/autodiff/samediff/ops/SDNN.java#L684-L709

It explicitly tells you what kind of input shape it expects.

Actually an expandDims to get the leading 1 and then permute to get it in the order that you want.

How concurrent exactly it runs depends on the data size you put into it. But due to the way it necessarily needs to work it is somewhat slow. There is little you can do about it other than getting more powerful hardware.

Thanks again for the quick response.

Actually I’m running the training of the model on my CPU und only 1 out of 8 CPU cores is being used for the calculations. When I used some of the reinforcement models or SkipGram in Word2vec, the CPU load was more than 50%. That’s why I was curious, if I used a wrong config for SelfAttentionLayer model. If I understand it correctly, the more batch size and sequence length I use for those calculations - the more concurrent CPU resources the calculations will take? I took a look into https://github.com/eclipse/deeplearning4j/blob/master/libnd4j/include/helpers/impl/AttentionHelper.cpp and https://github.com/eclipse/deeplearning4j/blob/master/libnd4j/include/ops/declarable/generic/nn/multi_head_dot_product_attention.cpp but couldn’t figure out, if that’s the case

Attention on its own is mostly a series of matrix multiplications with a softmax in-between.

Because of this, all of the concurrency comes from the BLAS library deciding that it is reasonable to use more than one core. When you feed it with not enough data, then it will not do so.

Perfect. Thanks a lot for the info!
One more question - While taking a look at the gradients after fitting the model I noticed that the matrices Wo and Wv have the gradients calculated, but Wk and Wq don’t. Is it a correct behavior or a wrong configuration? Or should I use LearnedSelfAttentionLayer in order to get the back-propagation to at least Wq ?

Gradients should be propagated to all weights. However, depending on how you are looking at it, it might look like the gradient is exactly zero in cases where it is just very small.

Ok, thanks. I’ll try changing the learning rate to see if the gradients for those 2 matrices are not really zeros.

Tested it with a higher learning rate. Gradients are really there, thanks a lot for the help! Now I’ll try to adjust the labels to see if the learning itself works with my config…

Sorry @treo for disturbing again, but I tried my config out and I get really weird results. After adapting my input data and labels (which are one-hot arrays based on vocab size and are used at the same time as label masks) after feeding the network (which is a simple single Transformer’s encoder) with a sequence of words I get the Softmax output which is the same for each word. Most probably I’m doing something wrong but for now I can’t identify the cause. My network config:

new NeuralNetConfiguration.Builder()
                .dataType(DataType.FLOAT)
                .updater(new Adam(1e-3))
                .activation(Activation.TANH)
                .optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
                .weightInit(WeightInit.XAVIER)
                .list()
                .layer(new SelfAttentionLayer.Builder().nOut(embeddingSize).nHeads(5).nIn(embeddingSize)
                        .projectInput(true).build())
                .layer(new DenseLayer.Builder().nIn(embeddingSize).nOut(embeddingSize).activation(Activation.RELU)
                        .build())
                .layer(new OutputLayer.Builder().nOut(vocab.size()).activation(Activation.SOFTMAX)
                        .lossFunction(LossFunctions.LossFunction.MSE).build())
                .setInputType(InputType.recurrent(embeddingSize))
                .build();

As the output I expect the shape [batchSize * sequenceLength, vocab.size()]. So if I input only one sentence (sequence), I expect to get the Softmax distribution for each word within a vocabulary. But when I take a look at this distribution for each word in the sequence - it’s always the same, e.g. second word in the vocab - which is not expected because embeddings for each word are unique. Would be great to know what I’m missing here.

I’m wondering how your config even works.

The SelfAttentionLayer takes input of the shape [batchSize, nIn, timesteps] and produces an output in the shape of [batchSize, nOut, timesteps].

If I remember correctly, feeding a timeseries type input into a Dense Layer then uses just its last timestep.

So your total network output has the shape [batchSize, outputSize].

And then you have a mean square error loss function, which doesn’t make a lot of sense together with a softmax activation.

I’m confused how exactly you are even getting to your result.

If you had used an RNN output layer without the DenseLayer in the middle, that would have made some more sense, and you would have gotten a result with the proper shape of [batchSize, outputSize, timesteps].

That was my guess as well. Because there’s no explicit Transformer’s example in DL4J, I tried to do something resembling based on existing unit tests but seems like I failed.

Actually there’s a RnnToFeedForwardPreProcessor which reshapes the output without any loss. But today I realized it makes no sense to use that workflow because the prediction needs to be based on the whole sequence, not separate words. So I replaced the last dense layer with a GlobalPoolingLayer and PoolingType.AVG. I need to use Dense layers because they are the part of a standard transformer’s encoder (I’m still struggling to add the residual connections and layer normalization).

Thanks a lot for pointing that. I did use it primarily but because I misinterpreted the core meaning of having a pooling layer, I decided to use label masks and LossFunction.MCXENT doesn’t allow them to be used.

As far as I understand, Pooling layer does the job of generalizing the timeSteps output, thus in my case catching the representation of the whole sentence (sequence). Adding additional positional encoding should take care of bidirectional context relationships.

After applying the pooling layer and MCXENT loss function I finally started getting at least partially acceptable predictions. Thanks a lot @treo !!!

Now I plan to shape this model based on the architecture, similar to BERT, use BertWordPieceTokenizer in order to get the vocab with an acceptable size and start training this model on the real world data. Hopefully I get no blockers while trying to build the final model.

You might want to create your own SameDiffLayer and build the transformer units that way. I guess, it will be easier that way.

See https://github.com/eclipse/deeplearning4j/blob/master/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/conf/layers/SelfAttentionLayer.java for an example of how it can be used.

Ok. Thanks for an advice!

I decided to stick to your advice @treo and implemented my own SameDiffLayer which would basically represent a single transformer’s encoder (attention layer + dense layer + normalization). Unfortunately I got stuck with 2 issues:

  1. In order to feed forward the attention layer results into the dense layer (in my case I wanted to use sameDiff.nn.reluLayer() ) I need to reshape the RNN format into FFN, but using directly a RnnToFeedForwardPreProcessor is not an option because the latter works with INDArrays, not SDVariables. Also direct reshaping didn’t work because in public SDVariable defineLayer() method, where my new logic resides, it’s impossible to get the access to the batchSize in SDVariable layerInput (makes sense, because it’s not yet there) and I couldn’t find any workarounds (e.g. some lazy-like retrieval of batchSize from layerInput )
  2. At some point I’m gonna need the second optional output of MultiHeadDotProductAttention - attention weights. I tried to retrieve them as INDArray inside my SameDiffLayer implementation but because it’s an ARRAY - I can’t fetch it directly from outside. Thought of using SameDiffLayer.getLayerParams() but I’m not sure it’s gonna give me exactly the weights which I need (I wanted the info structure which is described on the page 13 of original paper: https://arxiv.org/pdf/1706.03762.pdf )

I can workaround the first issue by using existing SelfAttentionLayer and adding after it the DenseLayer and then normalization into the model’s layer ListBuilder (it will be most probably less efficient than having own SameDiffLayer combining it all at once) but it won’t help me anyway in resolving the issue #2 (I need those attention weights for the further linguistical analysis)

A dense layer is just y=f(Wx+b), so you can also just apply that calculation instead of using a predefined layer that expects a specific input shape.

And while you can’t reshape because you don’t have access to the actual shape, you can still permute, expandDims and squeeze.

In that case your best bet is to define your entire network with just SameDiff. That gives you more options of recovering any optional output.

The DL4J Layer system is a bit of a straight jacket as it forces you into a certain way of working, and optional outputs aren’t a part of that.

I guess you man an element-wise multiplication in this case? I also though of that at some time but wasn’t sure if that’s the right solution so I decided it would be efficient to reuse already existing one.

I actually tried to do that - like e.g. using for each encoder layer a unique set of params (all self-attention layer params + dense layer weights) and packing that all into one custom SameDiffLayer. I thought that would automatically eliminate the problem of the vanishing gradient (it’s basically the same layer) plus increase the performance due to the fact that all the feed-forward and back-propagation operations would then be executed inside the same graph (SameDiffLayer). But the resulting performance of that layer was almost 10 times slower than in case of simply using separate layers one by one inside the model. Seems like I lack some important knowledge of the internals of SameDiff workflow and memory management in order to resolve that issue.

That it is still a matrix multiplication, but you are using the same matrix for each timestep and no recurrence.

That I suggested and what you’re suggesting here are two different things.

What I’m talking about looks like this in practice: https://github.com/eclipse/deeplearning4j-examples/blob/master/samediff-examples/src/main/java/org/nd4j/examples/samediff/quickstart/modeling/MNISTCNN.java

The gradient doesn’t necessarily care about layers. It “vanishes” when it needs to go through too many steps (i.e. it is multiplied with a value <1.0 at each step, and if you do that often enough it is effectively gone; that’s why RNN’s don’t work that well beyond a certain length).

This too doesn’t quite work that way.

It depends on what you are actually doing, in principle you should have the exactly same speed – that is if you are using the most recent version. In previous versions, not all of SameDiff was GPU accelerated.

Thanks a lot for these tips @treo ! I’ll take a look at that example you provided to figure out how I could implement what I need. Hopefully I also get some more knowledge regarding SameDiffLayers workflow from there. Regarding the performance stuff - I’m currently using only CPU and both options were tested directly on it. So I guess the problem is the wrong SameDiffLayer implementation from my side.