Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make transformer encoder and decoder model accept input size of (batch_size, sequence_length)?

I'm new to ML and i'm trying to make a encoder-decoder model to generate emmet code from screenshot. I have made a dataset consisting of screenshots and its corresponding emmet code(it's some kind of abbreviations of html code). I use a swinTransformer to extract image features from image, and then i have an encoder input of (32, 512)(which is (batch_size, sequnce_length). But i've learnt that the transformer encoder expects an input size of (batch_size, sequnce_length, embeddings). Did i do sth. wrong with the extracting features step or is it possible to modify the transformer encoder to accept my input? Please help me understand this, thank you very much! My code looks like this:

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
from build_dataset import EmmetDataset 
from swin_transformer_pytorch import SwinTransformer
from transformer_encoder import TransformerEncoder


STModel = SwinTransformer(
    hidden_dim=96,
    layers=(2, 2, 6, 2),
    heads=(3, 6, 12, 24),
    channels=3,
    num_classes=512,
    head_dim=32,
    window_size=4,
    downscaling_factors=(4, 2, 2, 2),
    relative_pos_embedding=True
)
encoder = TransformerEncoder(d_model=512, num_heads=8, num_layers=6)

train_dataset = EmmetDataset('train')
val_dataset = EmmetDataset('val')
test_dataset = EmmetDataset('test')

train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=32, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=32, shuffle=False)
num_epochs = 10

for epoch in range(num_epochs):
    for i, (screenshot_tensor, serialized_code_tensor) in enumerate(train_dataloader):
        print(screenshot_tensor.shape) # [32, 3, 768, 768]
        print(serialized_code_tensor.shape) # [32, 512]
        # swinTransformer to extract features
        features = STModel(screenshot_tensor)
        print(features.shape) # [32, 512]
        # Encoder-Decoder
        encoder_output = encoder(features) # the encode expects an input of (batch_size, sequnce_length, embeddings), but i only got an input of (batch_size, sequnce_length)
        print(encoder_output)
        # ... ...

And my model looks like this: my model

like image 207
GJ1214 Avatar asked Dec 09 '25 08:12

GJ1214


1 Answers

This is not really an answer, just a lengthy comment/question...

I am not quite sure what you want to exactly achieve here. Just based on the variable names, the SwinTransformer is supposed to be an image classifier, which is by itself not a problem, maybe you can use it as a feature extractor too.

But why feed it into a TransformerEncoder? Those are usually used for some kind of set/sequential data (in the latter you need position encoding too), but your Swin, only gives you a feature vector (class probabilities).

If you are completely sure that what you are doing is sensible, then you can just reshape your feature vector to fit the expected shape of the TransformerEncoder with features = np.reshape(features, (32,1,512)), and I think that should "run", but I doubt it will do what you might expect it to do.

like image 77
Phoenixdust Avatar answered Dec 12 '25 09:12

Phoenixdust



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!