Pytorch tensor.save() produces huge files for small tensors from MNIST

Tags:

I'm working with MNIST dataset from Kaggle challange and have troubles preprocessing with data. Furthermore, I don't know what are the best practices and was wondering if you could advise me on that.

Disclaimer: I can't just use torchvision.datasets.mnist because I need to use Kaggle's data for training and submission.

In this tutorial, it was advised to create a Dataset object loading .pt tensors from files, to fully utilize GPU. In order to achieve that, I needed to load the csv data provided by Kaggle and save it as .pt files:

import pandas as pd
import torch
import numpy as np

# import data
digits_train = pd.read_csv('data/train.csv')

train_tensor = torch.tensor(digits_train.drop(label, axis=1).to_numpy(), dtype=torch.int)
labels_tensor = torch.tensor(digits_train[label].to_numpy()) 

for i in range(train_tensor.shape[0]):
    torch.save(train_tensor[i], "data/train-" + str(i) + ".pt")

Each train_tensor[i].shape is torch.Size([1, 784])

However, each such .pt file has size of about 130MB. A tensor of the same size, with randomly generated integers, has size of 6.6kB. Why are these tensors so huge, and how can I reduce their size?

Dataset is 42 000 samples. Should I even bother with batching this data? Should I bother with saving tensors to separate files, rather than loading them all into RAM and then slicing into batches? What is the most optimal approach here?

844

asked Feb 26 '20 19:02

matwasilewski

1 Answers

As explained in this discussion, torch.save() saves the whole tensor, not just the slice. You need to explicitly copy the data using clone().

Don't worry, at runtime the data is only allocated once unless you explicitly create copies.

As a general advice: If the data easily fits into your memory, just load it at once. For MNIST with 130 MB that's certainly the case.

However, I would still batch the data because it converges faster. Look up the advantages of SGD for more details.

190

answered Oct 23 '22 14:10

The Floe

Related questions
                            
                                How to generate CNN heatmaps using built-in Keras in TF2.0 (tf.keras)
                            
                                Convert PySpark DenseVector to array
                            
                                How do you get the url of a postgresql database you made in pgadmin?
                            
                                How to use folium.icon with fontawesome
                            
                                Upgrading tf.contrib.slim manually to tf 2.0
                            
                                list() applied to zip object twice in a row issue
                            
                                RuntimeError: output with shape [1, 224, 224] doesn't match the broadcast shape [3, 224, 224]
                            
                                Get only file names from s3 bucket folder
                            
                                User does not have storage.objects.list access to bucket
                            
                                Matplotlib how to draw vertical line between two Y points
                            
                                Plotly Dash application not running
                            
                                Is it possible to put numbers on top of a matplot histogram?
                            
                                AttributeError: 'FastAPI' object has no attribute 'logger'
                            
                                Difference between "conda install" with "-c anaconda" and without it
                            
                                How to replace all pixels of a certain RGB value with another RGB value in OpenCV
                            
                                Is it possible to specify the pickle protocol when writing pandas to HDF5?
                            
                                Jupyter Notebook doesn't Uninstall package with pip [duplicate]
                            
                                `pydot` failed to call GraphViz.Please install GraphViz and ensure that its executables are in the $PATH
                            
                                How to Save io.BytesIO pdfrw PDF into Django FileField
                            
                                Dask fails with freeze_support bug

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pytorch tensor.save() produces huge files for small tensors from MNIST

Tags:

python

pytorch

data-science

kaggle

matwasilewski

People also ask

1 Answers

The Floe

Recent Activity

Donate For Us