You first need to understand how feature masking in general works: It effectively zeros out “useless” output.
In your scenario you are masking sequences so you can work with different sequence lengths in the same minibatch. This means that you only ever need to apply masking when it actually makes a difference.
That is also the reason why the attention ops care about a mask, as they need to know where they need to ignore the results when sequence spanning calculations are necessary.
If you are applying just direct transformations on each step individually, you don’t actually need to apply masking at that point, because the useless results do not influence the other steps.
In that case, you only need to apply masking at the loss step. How exactly that should be done will depend on the loss function you are using, but it is usually just a simple mul
(an element-wise multiplication) that is used to apply the mask.