I'm new to ML and i'm trying to make a encoder-decoder model to generate emmet code from screenshot. I have made a dataset consisting of screenshots and its corresponding emmet code(it's some kind of abbreviations of html code). I use a swinTransformer to extract image features from image, and then i have an encoder input of (32, 512)(which is (batch_size, sequnce_length). But i've learnt that the transformer encoder expects an input size of (batch_size, sequnce_length, embeddings). Did i do sth. wrong with the extracting features step or is it possible to modify the transformer encoder to accept my input? Please help me understand this, thank you very much! My code looks like this:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
from build_dataset import EmmetDataset
from swin_transformer_pytorch import SwinTransformer
from transformer_encoder import TransformerEncoder
STModel = SwinTransformer(
hidden_dim=96,
layers=(2, 2, 6, 2),
heads=(3, 6, 12, 24),
channels=3,
num_classes=512,
head_dim=32,
window_size=4,
downscaling_factors=(4, 2, 2, 2),
relative_pos_embedding=True
)
encoder = TransformerEncoder(d_model=512, num_heads=8, num_layers=6)
train_dataset = EmmetDataset('train')
val_dataset = EmmetDataset('val')
test_dataset = EmmetDataset('test')
train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=32, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=32, shuffle=False)
num_epochs = 10
for epoch in range(num_epochs):
for i, (screenshot_tensor, serialized_code_tensor) in enumerate(train_dataloader):
print(screenshot_tensor.shape) # [32, 3, 768, 768]
print(serialized_code_tensor.shape) # [32, 512]
# swinTransformer to extract features
features = STModel(screenshot_tensor)
print(features.shape) # [32, 512]
# Encoder-Decoder
encoder_output = encoder(features) # the encode expects an input of (batch_size, sequnce_length, embeddings), but i only got an input of (batch_size, sequnce_length)
print(encoder_output)
# ... ...
And my model looks like this: my model
This is not really an answer, just a lengthy comment/question...
I am not quite sure what you want to exactly achieve here. Just based on the variable names, the SwinTransformer is supposed to be an image classifier, which is by itself not a problem, maybe you can use it as a feature extractor too.
But why feed it into a TransformerEncoder? Those are usually used for some kind of set/sequential data (in the latter you need position encoding too), but your Swin, only gives you a feature vector (class probabilities).
If you are completely sure that what you are doing is sensible, then you can just reshape your feature vector to fit the expected shape of the TransformerEncoder with features = np.reshape(features, (32,1,512)), and I think that should "run", but I doubt it will do what you might expect it to do.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With