Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using the full PyTorch Transformer Module

I tried asking this question on the PyTorch forums but didn't get any response so I am hoping someone here can help me. Additionally, if anyone has a good example of using the transformer module please share it as the documentation only shows using a simple linear decoder. For the transformer I'm aware that we generally feed in the actual target sequence. Therefore, my first question is that prior to the transformer I have a standard linear layer to transform my time series sequence along with positional encodings. According to the documentation the transformer module code the src and trg sequence need to be the same dimension.

    from torch.nn.modules.transformer import Transformer
    class TransformerTimeSeries(torch.nn.Module):
       def __init__(self, n_time_series, d_model=128):
          super()._init__()
          self.dense_shape = torch.nn.Linear(n_time_series, d_model)
          self.pe = SimplePositionalEncoding(d_model)
          self.transformer = Transformer(d_model, nhead=8)

So I was wondering can I simply do something like this or will this somehow leak information about the target? I'm still not actually sure how loss.backward() works so I'm not sure if this will cause problems.

    def forward(self, x, t):
       x = self.dense_shape(x)
       x = self.pe(x)
       t = self.dense_shape(t)
       t = self.pe(t)
       x = self.transformer(x, t)

Secondly, does the target sequence need any sort of offset? So for instance if I have the time series [0,1,2,3,4,5,6,7] and I want to feed in [0,1,2,3] to predict [4,5,6,7] (tgt)? Would I simply feed it in like that or is it more complicated? Typically BERT and those models have [CLS] and [SEP] tokens to denote the beginning and end of sentences however, for time series I assume I don't need a separator time step.

like image 709
igodfried Avatar asked Nov 06 '19 15:11

igodfried


1 Answers

loss.backward() traverses through the gradient graph of the model and updates the gradients of each component in the way. You can see the graph using an auxiliary library named PytorchViz. Here is an example of what you can visualize using this library:

enter image description here

Whether you use it or not, it looks like you are using the same dense layer for both the target and the input. Since the dense layer will keep track of the gradients it will calculate the gradient of the target in addition to the gradients input which will indeed cause the model to learn based on the target sequence.

As for your second question. I think feeding the model [0,1,2,3] in order to predict [4,5,6,7] will work fine depending on what data you are using. If you are using periodic signals (ie ecg time series, sinx etc) I think it will do a great job as is with no further complications needed.

However if you want to predict certain events like end of sentence or price prediction (like end of trading day) then you will need to add tokens to create a robust model (not to say that without them it will fail, but it can definitely help the prediction accuracy) .

like image 180
Dr. Prof. Patrick Avatar answered Nov 17 '22 16:11

Dr. Prof. Patrick