How to make transformer encoder and decoder model accept input size of (batch_size, sequence_length)?

Question

I'm new to ML and i'm trying to make a encoder-decoder model to generate emmet code from screenshot. I have made a dataset consisting of screenshots and its corresponding emmet code(it's some kind of abbreviations of html code). I use a swinTransformer to extract image features from image, and then i have an encoder input of (32, 512)(which is (batch_size, sequnce_length). But i've learnt that the transformer encoder expects an input size of (batch_size, sequnce_length, embeddings). Did i do sth. wrong with the extracting features step or is it possible to modify the transformer encoder to accept my input? Please help me understand this, thank you very much! My code looks like this：

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
from build_dataset import EmmetDataset 
from swin_transformer_pytorch import SwinTransformer
from transformer_encoder import TransformerEncoder


STModel = SwinTransformer(
    hidden_dim=96,
    layers=(2, 2, 6, 2),
    heads=(3, 6, 12, 24),
    channels=3,
    num_classes=512,
    head_dim=32,
    window_size=4,
    downscaling_factors=(4, 2, 2, 2),
    relative_pos_embedding=True
)
encoder = TransformerEncoder(d_model=512, num_heads=8, num_layers=6)

train_dataset = EmmetDataset('train')
val_dataset = EmmetDataset('val')
test_dataset = EmmetDataset('test')

train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=32, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=32, shuffle=False)
num_epochs = 10

for epoch in range(num_epochs):
    for i, (screenshot_tensor, serialized_code_tensor) in enumerate(train_dataloader):
        print(screenshot_tensor.shape) # [32, 3, 768, 768]
        print(serialized_code_tensor.shape) # [32, 512]
        # swinTransformer to extract features
        features = STModel(screenshot_tensor)
        print(features.shape) # [32, 512]
        # Encoder-Decoder
        encoder_output = encoder(features) # the encode expects an input of (batch_size, sequnce_length, embeddings), but i only got an input of (batch_size, sequnce_length)
        print(encoder_output)
        # ... ...

And my model looks like this: my model

Phoenixdust · Accepted Answer

This is not really an answer, just a lengthy comment/question...

I am not quite sure what you want to exactly achieve here. Just based on the variable names, the SwinTransformer is supposed to be an image classifier, which is by itself not a problem, maybe you can use it as a feature extractor too.

But why feed it into a TransformerEncoder? Those are usually used for some kind of set/sequential data (in the latter you need position encoding too), but your Swin, only gives you a feature vector (class probabilities).

If you are completely sure that what you are doing is sensible, then you can just reshape your feature vector to fit the expected shape of the TransformerEncoder with features = np.reshape(features, (32,1,512)), and I think that should "run", but I doubt it will do what you might expect it to do.

How to make transformer encoder and decoder model accept input size of (batch_size, sequence_length)?

Tags:

code-generation

transformer-model

GJ1214

1 Answers

Phoenixdust

Recent Activity

Donate For Us

How to make transformer encoder and decoder model accept input size of (batch_size, sequence_length)?

Tags:

code-generation

transformer-model

GJ1214

1 Answers

Phoenixdust

Related questions

Recent Activity

Donate For Us