I may be mistaken, but it seems that PyTorch Transformers are autoregressive, which is what masking is for. However, I've seen some implementations where people use just the Encoder and output that directly to a Linear
layer.
In my case, I'm trying to convert a spectrogram (rows are frequencies and columns are timesteps) to another spectrogram of the same dimensions. I'm having an impossible time trying to figure out how to do this.
For my model, I have:
class TransformerReconstruct(nn.Module):
def __init__(self, feature_size=250, num_layers=1, dropout=0.1, nhead=10, output_dim=1):
super(TransformerReconstruct, self).__init__()
self.model_type = 'Transformer'
self.src_mask = None
self.pos_encoder = PositionalEncoding(feature_size)
self.encoder_layer = nn.TransformerEncoderLayer(d_model=feature_size, nhead=nhead, dropout=dropout)
self.transformer_encoder = nn.TransformerEncoder(self.encoder_layer, num_layers=num_layers)
self.decoder = nn.Linear(feature_size, output_dim)
self.init_weights()
def init_weights(self):
initrange = 0.1
self.decoder.bias.data.zero_()
self.decoder.weight.data.uniform_(-initrange, initrange)
def forward(self, src):
if self.src_mask is None or self.src_mask.size(0) != len(src):
device = src.device
mask = self._generate_square_subsequent_mask(len(src)).to(device)
self.src_mask = mask
src = self.pos_encoder(src)
output = self.transformer_encoder(src, self.src_mask)
output = self.decoder(output)
return output
def _generate_square_subsequent_mask(self, sz):
mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
return mask
And when training, I have:
model = TransformerReconstruct(feature_size=128, nhead=8, output_dim=128, num_layers=6).to(device)
This returns the right shape, but doesn't seem to learn.
My basic training loop looks like:
for i in range(0, len(data_source) - 1, input_window):
data, target = get_batch(data_source, i, 1)
output = recreate_model(data)
and I'm using an MSELoss
and I'm trying to learn a very simple identity. Where the input and output are the same, however this is not learning. What could I be doing wrong? Thanks in advance.
Most of the models in Huggingface Transformers are some version of BERT and thus not autoregressive, the only exceptions are decoder-only models (GPT and similar) and sequence-to-sequence model.
There are two conceptually different types of masks: one is the input mask that is specific to the input batch and the purpose is allowing using sequences of different lengths in a single batch. When the sequences get padded to the same length, the self-attention should attend to the padding positions. This is what you are supposed to use when you call self.transformer_encoder
in the forward method.
In addition, the autoregressive Transformer decoder uses another type of mask. It is the triangular mask that prevents the self-attention to attend to tokens that are right of the current position (at inference time, words right of the current position are unknown before they are actually generated). This is what you have in the _generate_square_subsequent_mask
method and this is what makes the model autoregressive. It is constant and does not depend on the input batch.
To summarize: to have a bidirectional Transformer, just get rid of the triangular mask. If your input sequences are of different lengths, you should use batch-specific masking, if not, just pass a matrix with ones.
If you want the model to stop behaving in an autoregressive manner, you need to 'unhide' tokens to the right of the current token i.e. modify/remove _generate_square_subsequent_mask
.
How you modify this depends on the task. Are you trying to recover 'corrupted' input sequences? Then mask a random subset of tokens and treat it as an autoencoder.
If you just wish to approximate the identity function remove the mask completely.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With