I'm working with MNIST dataset from Kaggle challange and have troubles preprocessing with data. Furthermore, I don't know what are the best practices and was wondering if you could advise me on that.
Disclaimer: I can't just use torchvision.datasets.mnist because I need to use Kaggle's data for training and submission.
In this tutorial, it was advised to create a Dataset object loading .pt tensors from files, to fully utilize GPU. In order to achieve that, I needed to load the csv data provided by Kaggle and save it as .pt files:
import pandas as pd
import torch
import numpy as np
# import data
digits_train = pd.read_csv('data/train.csv')
train_tensor = torch.tensor(digits_train.drop(label, axis=1).to_numpy(), dtype=torch.int)
labels_tensor = torch.tensor(digits_train[label].to_numpy())
for i in range(train_tensor.shape[0]):
torch.save(train_tensor[i], "data/train-" + str(i) + ".pt")
Each train_tensor[i].shape
is torch.Size([1, 784])
However, each such .pt file has size of about 130MB. A tensor of the same size, with randomly generated integers, has size of 6.6kB. Why are these tensors so huge, and how can I reduce their size?
Dataset is 42 000 samples. Should I even bother with batching this data? Should I bother with saving tensors to separate files, rather than loading them all into RAM and then slicing into batches? What is the most optimal approach here?
save. Saves an object to a disk file. A common PyTorch convention is to save tensors using .
You can first convert the tensor to a Lua table using torch. totable. Then use the csvigo library to save the table as a csv file.
torch. save: Saves a serialized object to disk. This function uses Python's pickle utility for serialization. Models, tensors, and dictionaries of all kinds of objects can be saved using this function.
As explained in this discussion, torch.save()
saves the whole tensor, not just the slice. You need to explicitly copy the data using clone()
.
Don't worry, at runtime the data is only allocated once unless you explicitly create copies.
As a general advice: If the data easily fits into your memory, just load it at once. For MNIST with 130 MB that's certainly the case.
However, I would still batch the data because it converges faster. Look up the advantages of SGD for more details.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With