Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pytorch tensor.save() produces huge files for small tensors from MNIST

I'm working with MNIST dataset from Kaggle challange and have troubles preprocessing with data. Furthermore, I don't know what are the best practices and was wondering if you could advise me on that.

Disclaimer: I can't just use torchvision.datasets.mnist because I need to use Kaggle's data for training and submission.

In this tutorial, it was advised to create a Dataset object loading .pt tensors from files, to fully utilize GPU. In order to achieve that, I needed to load the csv data provided by Kaggle and save it as .pt files:

import pandas as pd
import torch
import numpy as np

# import data
digits_train = pd.read_csv('data/train.csv')

train_tensor = torch.tensor(digits_train.drop(label, axis=1).to_numpy(), dtype=torch.int)
labels_tensor = torch.tensor(digits_train[label].to_numpy()) 

for i in range(train_tensor.shape[0]):
    torch.save(train_tensor[i], "data/train-" + str(i) + ".pt")

Each train_tensor[i].shape is torch.Size([1, 784])

However, each such .pt file has size of about 130MB. A tensor of the same size, with randomly generated integers, has size of 6.6kB. Why are these tensors so huge, and how can I reduce their size?

Dataset is 42 000 samples. Should I even bother with batching this data? Should I bother with saving tensors to separate files, rather than loading them all into RAM and then slicing into batches? What is the most optimal approach here?

like image 844
matwasilewski Avatar asked Feb 26 '20 19:02

matwasilewski


People also ask

Where does torch save save to?

save. Saves an object to a disk file. A common PyTorch convention is to save tensors using .

How do I save a tensor to a csv file?

You can first convert the tensor to a Lua table using torch. totable. Then use the csvigo library to save the table as a csv file.

How does torch Save Work?

torch. save: Saves a serialized object to disk. This function uses Python's pickle utility for serialization. Models, tensors, and dictionaries of all kinds of objects can be saved using this function.


1 Answers

As explained in this discussion, torch.save() saves the whole tensor, not just the slice. You need to explicitly copy the data using clone().

Don't worry, at runtime the data is only allocated once unless you explicitly create copies.

As a general advice: If the data easily fits into your memory, just load it at once. For MNIST with 130 MB that's certainly the case.

However, I would still batch the data because it converges faster. Look up the advantages of SGD for more details.

like image 190
The Floe Avatar answered Oct 23 '22 14:10

The Floe