Hi. I’m trying to implement my own SameDiff
model without invoking SameDiffLayers
directly (e.g. I want to invoke linear transformations without bias). I’ll be using RNN-like inputs in combination with a feed forward behavior as with DenseLayer
. The problem is that the sequence length is not fixed thus I need feature masks. I’ll be using MultiHeadDotProductAttention
operation in between, but it already takes the feature masks into account. What remains are e.g. ReLU layer transformations before MultiHeadDotProductAttention
and so on, which don’t have by default the feature masks handling. While analyzing the BaseLayer
, BaseOutputLayer
and others in MultiLayerNetwork
I’ve noticed that they process feature masks on different stages (FF, BP, Score calculation etc.) which is quite complex and makes it really quite a challenge to implement them in a custom SameDiff
model. Is there a similar workflow or any solution which could be used in custom SameDiff
models?
You first need to understand how feature masking in general works: It effectively zeros out “useless” output.
In your scenario you are masking sequences so you can work with different sequence lengths in the same minibatch. This means that you only ever need to apply masking when it actually makes a difference.
That is also the reason why the attention ops care about a mask, as they need to know where they need to ignore the results when sequence spanning calculations are necessary.
If you are applying just direct transformations on each step individually, you don’t actually need to apply masking at that point, because the useless results do not influence the other steps.
In that case, you only need to apply masking at the loss step. How exactly that should be done will depend on the loss function you are using, but it is usually just a simple mul
(an element-wise multiplication) that is used to apply the mask.
Thanks for a really quick reply! I really appreciate that.
In my case I always have not fully packed sequences (I use full sentences for input right now, which have different length). So I need to use feature masks almost always - but the masks themselves are automatically also always different.
Actually in my case the output of one layer is the input to the other one. Hidden layer output is the input to the SelfAttentionLayer, the latter provides the data for the ReLU layer etc. Based on what you wrote, I suppose that I need to take care only of zeroing out the outputs which are the inputs to other layers, right?
I use the following code for the output layer:
SDVariable outputLayerResult = attentionOutputPooledByMean.mmul("outputLayerResult", outputLayerWeights);
SDVariable logitPredictions = sd.nn().softmax(MODEL_OUTPUT_SD_VARIABLE_NAME, outputLayerResult, 1);
SDVariable loss = sd.loss().softmaxCrossEntropy("loss", labels, logitPredictions, null);
sd.setLossVariables(loss);
Where exactly in this case should I use the element-wise multiplication? On attentionOutputPooledByMean variable? My feature masks have the form [miniBatchSize, sequenceLength], logits - [miniBatchSize, vocabSize]. If I’m not wrong, I have to apply the feature masks earlier when calculating attentionOutputPooledByMean.
You define your masks as placeholders and then you provide their actual values during training or evaluation
You’ve got to think of it in terms of a calculation instead of layers. Technically the things you mask out will also be used in the calculations, but then you zero out the results.
When you apply a matrix multiplication to each step in a sequence (i.e. what a dense layer does), each result is independent of the other results. Zeroing out the unneeded steps doesn’t make a difference here.
When you calculate a softmax / pooling / etc across the entire sequence, the results of those unneeded calculations would influence the result of the calculation, and you therefore need to apply masking.
This means that you need to understand when and how the not needed results will influence other calculations and apply masking appropriately there.
That is also the reason that the attention ops take a masking parameter.
In this case you would use it on outputLayerResult
. That way the results you don’t care about will not be taken into the consideration of the softmax (or rather they will be zero, thereby getting a zero probability).
Because of this element-wise zeroing, those calculations will also not influence the gradient.
That’s exactly what I do in my model in order to use those feature masks in SameDiff
That’s why I was asking about attentionOutputPooledByMean
- because this variable is actually an average-pooling result of MultiHeadDotProductAttention
.
I’m a little confused here. My feature masks have the form [miniBatchSize, timeSeriesLength]
, outputLayerResult - [miniBatchSize, vocabSize]
. Matrix multiplication in this case can’t be done. The input to outputLayer transformations is the average-pooling result, but it already takes the feature masks into account.
I was actually also thinking of back-propagation when applying the feature masks to that average pooling of MultiHeadDotProductAttention
results. Because SameDiff has no RNN-based pooling OP, I implemented it manually using org.deeplearning4j.util.MaskedReductionUtil#maskedPoolingTimeSeries()
in GlobalPoolingLayer
as an example. However, this class has a custom logic in public Pair<Gradient, INDArray> backpropGradient()
so I was worried if I’m missing this part in my SameDiff implementation.
Ah, I see. I thought you’d still have the full timeseries on your output. In that case you’ve got to apply the masking within your pooling.
As you are using average pooling, you’ll need to apply your mask before collecting the sum, and then when dividing by the series length, you’ll need to account for the masked steps.
I’m already doing that - just as in the example method in MaskedReductionUtil
.
The only open question which still remains - should I do anything about back-propagation similar to what is done in GlobalPoolingLayer
, backpropGradient()
? Or will SameDiff handle everything automatically ?
In that case backpropagation should be handled automatically.
Got it. Thanks a lot for your help @treo !!!
Hi @treo . I decided to test the final variant of the model I mentioned in this article. I performed training for some time and got the prediction accuracy for test data for a batch of more than 90%. After that I decided to test the model MLM prediction using a single sentence from the corpus. While an average sentence is quite short comparing to the sequence length of 128 which I use for training, I filled the rest of the sequence with padding tokens which are automatically marked as 0s in the mask for MultiHeadDotProductAttention. But my test results showed the prediction accuracy close to 0%. First I thought I screwed up with training but the training accuracy was ok and it also used masking padding tokens as 0s. After that I decided to take a look at the attention weights trying to check if self-attention really ignores the padding positions and concentrates solely on the sentence tokens. The results I got show that almost each attention head was hugely concentrated on those padding symbols, for some test sentences attention heads were looking almost solely at padding positions. Having the shape of those weights as [mini_batch][attention_heads][sequence_length][sequence_length] I used dimension 2 as the index of the token inside the sequence and dimension 3 (the last one) as the attention paid by this token to each of other tokens. After that I decided to do the analysis vice versa, permuting dimensions 2 and 3. That showed the results which really make sense. So I got wondering if it’s OK that dimension 2 represents the attention of a specific token paid by each other token and not vice versa. Because we have the identical size of both last weights dimensions, I suppose using incorrect one for nulling the masked positions could mess up the results. Although even with huge test sentences, so that not many padding tokens are used, I still get almost 0% accuracy. Not sure if it could be caused by incorrect masks handling in attention module, but for now I couldn’t find any other possible cause.
Actually I got a confirmation that the problem with prediction accuracy has something to do with padding (using feature masks) during prediction. Because during training I try to utilize as much sequence length as possible, it’s quite obvious that during prediction the sequence will be much shorter. The accuracy of predictions during training I assess by using the labels during training and actual model’s outputs. Those are different from the ones during the testing phase. So I tried to test using more or less the same data as during training. And then I I noticed that if I take the same sequence as during training and use it fully, the accuracy of prediction is really high. When however I truncate the sequence by at least 3 tokens and they are padded (which means 3 additional feature masks, which have not been used during training) - the prediction vehemently falls down to almost zero. On the other hand, which is incredible, if I truncate even 50-60% of the sequence and replace that part with any token except padding (like a dot or any other) so that the total length is the same as the one during training and no additional feature masks are used (only the ones which were used during training) - the prediction accuracy remains still high. Because each padding is automatically masked for Self-Attention module, they shouldn’t be taken into account at all. The fact that even 3 additional feature masks in the sequence can totally distort the prediction makes me think that there’s something about feature masking that has a huge impact on the result of MultiHeadDotProductAttention and I’m missing it out. Because I’m not quite good in that design, I’d appreciate any help from your side @treo. Any tip would be helpful.