Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use the PyTorch Transformer with multi-dimensional sequence-to-seqence?

I'm trying to go seq2seq with a Transformer model. My input and output are the same shape (torch.Size([499, 128]) where 499 is the sequence length and 128 is the number of features.

My input looks like: enter image description here

My output looks like: enter image description here

My training loop is:

    for batch in tqdm(dataset):
        optimizer.zero_grad()
        x, y = batch

        x = x.to(DEVICE)
        y = y.to(DEVICE)

        pred = model(x, torch.zeros(x.size()).to(DEVICE))

        loss = loss_fn(pred, y)
        loss.backward()
        optimizer.step()

My model is:

import math
from typing import final
import torch
import torch.nn as nn

class Reconstructor(nn.Module):
    def __init__(self, input_dim, output_dim, dim_embedding, num_layers=4, nhead=8, dim_feedforward=2048, dropout=0.5):
        super(Reconstructor, self).__init__()

        self.model_type = 'Transformer'
        self.src_mask = None
        self.pos_encoder = PositionalEncoding(d_model=dim_embedding, dropout=dropout)
        self.transformer = nn.Transformer(d_model=dim_embedding, nhead=nhead, dim_feedforward=dim_feedforward, num_encoder_layers=num_layers, num_decoder_layers=num_layers)
        self.decoder = nn.Linear(dim_embedding, output_dim)
        self.decoder_act_fn = nn.PReLU()

        self.init_weights()

    def init_weights(self):
        initrange = 0.1
        nn.init.zeros_(self.decoder.weight)
        nn.init.uniform_(self.decoder.weight, -initrange, initrange)

    def forward(self, src, tgt):

        pe_src = self.pos_encoder(src.permute(1, 0, 2))  # (seq, batch, features)
        transformer_output = self.transformer_encoder(pe_src)
        decoder_output = self.decoder(transformer_output.permute(1, 0, 2)).squeeze(2)
        decoder_output = self.decoder_act_fn(decoder_output)
        return decoder_output

My output has a shape of torch.Size([32, 499, 128]) where 32 is batch, 499 is my sequence length and 128 is the number of features. But the output has the same values:

tensor([[[0.0014, 0.0016, 0.0017,  ..., 0.0018, 0.0021, 0.0017],
         [0.0014, 0.0016, 0.0017,  ..., 0.0018, 0.0021, 0.0017],
         [0.0014, 0.0016, 0.0017,  ..., 0.0018, 0.0021, 0.0017],
         ...,
         [0.0014, 0.0016, 0.0017,  ..., 0.0018, 0.0021, 0.0017],
         [0.0014, 0.0016, 0.0017,  ..., 0.0018, 0.0021, 0.0017],
         [0.0014, 0.0016, 0.0017,  ..., 0.0018, 0.0021, 0.0017]]],
       grad_fn=<PreluBackward>)

What am I doing wrong? Thank you so much for any help.

like image 628
Shamoon Avatar asked Nov 17 '20 14:11

Shamoon


People also ask

What is PyTorch transformer?

PyTorch-Transformers (formerly known as pytorch-pretrained-bert ) is a library of state-of-the-art pre-trained models for Natural Language Processing (NLP).

Is sequence sequence transformer?

The transformer neural network is a novel architecture that aims to solve sequence-to-sequence tasks while handling long-range dependencies with ease. It was first proposed in the paper “Attention Is All You Need” and is now a state-of-the-art technique in the field of NLP.

What is Src_key_padding_mask?

src_key_padding_mask or key_padding_mask is a matrix that is supposed to mark the padding areas that the layer should not attend to. It is of size batch_size*N.


1 Answers

There are several points to be checked. As you have same output to the different inputs, I suspect that some layer zeros out all it's inputs. So check the outputs of the PositionalEncoding and also Encoder block of the Transformer, to make sure they are not constant. But before that, make sure your inputs differ (try to inject noise, for example).

Additionally, from what I see in the pictures, your input and output are speech signals and was sampled at 22.05kHz (I guess), so it should have ~10k features, but you claim that you have only 128. This is another place to check. Now, the number 499 represent some time slice. Make sure your slices are in reasonable range (20-50 msec, usually 30). If it is the case, then 30ms by 500 is 15 seconds, which is much more you have in your example. And finally you are masking off a third of a second of speech in your input, which is too much I believe.

I think it would be useful to examine Wav2vec and Wav2vec 2.0 papers, which tackle the problem of self supervised training in speech recognition domain using Transformer Encoder with great success.

like image 196
igrinis Avatar answered Oct 19 '22 01:10

igrinis