Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I do a seq2seq task with PyTorch Transformers if I am not trying to be autoregressive?

I may be mistaken, but it seems that PyTorch Transformers are autoregressive, which is what masking is for. However, I've seen some implementations where people use just the Encoder and output that directly to a Linear layer.

In my case, I'm trying to convert a spectrogram (rows are frequencies and columns are timesteps) to another spectrogram of the same dimensions. I'm having an impossible time trying to figure out how to do this.

For my model, I have:

class TransformerReconstruct(nn.Module):
    def __init__(self, feature_size=250, num_layers=1, dropout=0.1, nhead=10, output_dim=1):
        super(TransformerReconstruct, self).__init__()
        self.model_type = 'Transformer'

        self.src_mask = None
        self.pos_encoder = PositionalEncoding(feature_size)
        self.encoder_layer = nn.TransformerEncoderLayer(d_model=feature_size, nhead=nhead, dropout=dropout)
        self.transformer_encoder = nn.TransformerEncoder(self.encoder_layer, num_layers=num_layers)
        self.decoder = nn.Linear(feature_size, output_dim)
        self.init_weights()

    def init_weights(self):
        initrange = 0.1
        self.decoder.bias.data.zero_()
        self.decoder.weight.data.uniform_(-initrange, initrange)

    def forward(self, src):
        if self.src_mask is None or self.src_mask.size(0) != len(src):
            device = src.device
            mask = self._generate_square_subsequent_mask(len(src)).to(device)
            self.src_mask = mask

        src = self.pos_encoder(src)
        output = self.transformer_encoder(src, self.src_mask)
        output = self.decoder(output)
        return output

    def _generate_square_subsequent_mask(self, sz):
        mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
        mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
        return mask

And when training, I have:

model = TransformerReconstruct(feature_size=128, nhead=8, output_dim=128, num_layers=6).to(device)

This returns the right shape, but doesn't seem to learn.

My basic training loop looks like:

for i in range(0, len(data_source) - 1, input_window):
  data, target = get_batch(data_source, i, 1)
  output = recreate_model(data)

and I'm using an MSELoss and I'm trying to learn a very simple identity. Where the input and output are the same, however this is not learning. What could I be doing wrong? Thanks in advance.

like image 608
Shamoon Avatar asked Nov 11 '20 15:11

Shamoon


2 Answers

Most of the models in Huggingface Transformers are some version of BERT and thus not autoregressive, the only exceptions are decoder-only models (GPT and similar) and sequence-to-sequence model.

There are two conceptually different types of masks: one is the input mask that is specific to the input batch and the purpose is allowing using sequences of different lengths in a single batch. When the sequences get padded to the same length, the self-attention should attend to the padding positions. This is what you are supposed to use when you call self.transformer_encoder in the forward method.

In addition, the autoregressive Transformer decoder uses another type of mask. It is the triangular mask that prevents the self-attention to attend to tokens that are right of the current position (at inference time, words right of the current position are unknown before they are actually generated). This is what you have in the _generate_square_subsequent_mask method and this is what makes the model autoregressive. It is constant and does not depend on the input batch.

To summarize: to have a bidirectional Transformer, just get rid of the triangular mask. If your input sequences are of different lengths, you should use batch-specific masking, if not, just pass a matrix with ones.

like image 193
Jindřich Avatar answered Nov 03 '22 04:11

Jindřich


If you want the model to stop behaving in an autoregressive manner, you need to 'unhide' tokens to the right of the current token i.e. modify/remove _generate_square_subsequent_mask.

How you modify this depends on the task. Are you trying to recover 'corrupted' input sequences? Then mask a random subset of tokens and treat it as an autoencoder.

If you just wish to approximate the identity function remove the mask completely.

like image 28
iacob Avatar answered Nov 03 '22 03:11

iacob