Transformer's Encoder with Self-Attention layers

One more question @treo. I’ve put it already in the SameDiff section but maybe you know the answer. After implementing the mentioned in this chat encoder’s layer model with SameDiff I’ve made it to have all the basic transformations I need but got the problem of applying feature masks to the transformations other than MultiHeadDotProductAttention. Because each transformer encoder’s layer has one dense layer after each SelfAttentionLayer, I need to apply feature masks to each such layer (if not on the input then on the activations). Also I have a first hidden layer which takes token and positional embeddings as the input. It also needs to apply feature masks. I was looking into MultiLayerNetwork and saw there that practically feature masks are applied by doing a broadcast multiplication of pre-outputs and activations plus back propagation. The first two I can easily implement. But BP is my problem. I haven’t found any available solution to force the BP in SameDiff into using the feature masks. Do you have any suggestions how I could workaround this issue?
I thought of using ArgumentInterceptor but it’s an internal stuff, not a flexible option.

I’ve answered the question on when and how to apply a mask in SameDiff over at your other thread: Feature Mask application in custom SameDiff model

It isn’t necessary for you to do that. They will be automatically applied during backprop with SameDiff when you apply it in the forward pass. The entire point of SameDiff is that it does the entire backprop for you so you don’t have to define it manually.

1 Like