Transformer's Encoder with Self-Attention layers

partarstu · December 14, 2020, 7:01pm

One more question @treo. I’ve put it already in the SameDiff section but maybe you know the answer. After implementing the mentioned in this chat encoder’s layer model with SameDiff I’ve made it to have all the basic transformations I need but got the problem of applying feature masks to the transformations other than MultiHeadDotProductAttention. Because each transformer encoder’s layer has one dense layer after each SelfAttentionLayer, I need to apply feature masks to each such layer (if not on the input then on the activations). Also I have a first hidden layer which takes token and positional embeddings as the input. It also needs to apply feature masks. I was looking into MultiLayerNetwork and saw there that practically feature masks are applied by doing a broadcast multiplication of pre-outputs and activations plus back propagation. The first two I can easily implement. But BP is my problem. I haven’t found any available solution to force the BP in SameDiff into using the feature masks. Do you have any suggestions how I could workaround this issue?
I thought of using ArgumentInterceptor but it’s an internal stuff, not a flexible option.

treo · December 14, 2020, 7:46pm

I’ve answered the question on when and how to apply a mask in SameDiff over at your other thread: Feature Mask application in custom SameDiff model - #2 by treo

It isn’t necessary for you to do that. They will be automatically applied during backprop with SameDiff when you apply it in the forward pass. The entire point of SameDiff is that it does the entire backprop for you so you don’t have to define it manually.

Topic		Replies	Views
SelfAttention Token Training Example DL4J	2	258	January 5, 2023
Attention and Pooling Problem with Merge on Backpropagation DL4J	9	779	October 30, 2021
Autoencode sanity check DL4J	3	499	January 29, 2023
Creating custom layer in java Deeplearning4J DL4J	1	727	January 14, 2021
LSTM Training stops working in Snapshot	4	793	June 25, 2021

Transformer's Encoder with Self-Attention layers

Related topics