Number of instances per class in pytorch dataset

Tags:

I'm trying to make a simple image classifier using PyTorch. This is how I load the data into a dataset and dataLoader:

batch_size = 64
validation_split = 0.2
data_dir = PROJECT_PATH+"/categorized_products"
transform = transforms.Compose([transforms.Grayscale(), CustomToTensor()])

dataset = ImageFolder(data_dir, transform=transform)

indices = list(range(len(dataset)))

train_indices = indices[:int(len(indices)*0.8)] 
test_indices = indices[int(len(indices)*0.8):]

train_sampler = SubsetRandomSampler(train_indices)
test_sampler = SubsetRandomSampler(test_indices)

train_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, sampler=train_sampler, num_workers=16)
test_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, sampler=test_sampler, num_workers=16)

I want to print out the number of images in each class in training and test data separately, something like this:

In train data:

shoes: 20
shirts: 14

In test data:

shoes: 4
shirts: 3

I tried this:

from collections import Counter
print(dict(Counter(sample_tup[1] for sample_tup in dataset.imgs)))

but I got this error:

AttributeError: 'MyDataset' object has no attribute 'img'

208

asked Jun 11 '20 07:06

Amin Bashiri

1 Answers

You need to use .targets to access the labels of data i.e.

print(dict(Counter(dataset.targets)))

It'll print something like this (e.g. in MNIST dataset):

{5: 5421, 0: 5923, 4: 5842, 1: 6742, 9: 5949, 2: 5958, 3: 6131, 6: 5918, 7: 6265, 8: 5851}

Also, you can use .classes or .class_to_idx to get mapping of label id to classes:

print(dataset.class_to_idx)
{'0 - zero': 0,
 '1 - one': 1,
 '2 - two': 2,
 '3 - three': 3,
 '4 - four': 4,
 '5 - five': 5,
 '6 - six': 6,
 '7 - seven': 7,
 '8 - eight': 8,
 '9 - nine': 9}

Edit: Method 1

From the comments, in order to get class distribution of training and testing set separately, you can simply iterate over subset as below:

train_size = int(0.8 * len(dataset))
test_size = len(dataset) - train_size
train_dataset, test_dataset = torch.utils.data.random_split(dataset, [train_size, test_size])

# labels in training set
train_classes = [label for _, label in train_dataset]
Counter(train_classes)
Counter({0: 4757,
         1: 5363,
         2: 4782,
         3: 4874,
         4: 4678,
         5: 4321,
         6: 4747,
         7: 5024,
         8: 4684,
         9: 4770})

Edit (2): Method 2

Since you've a large dataset, and as you said it takes considerable time to iterate over all training set, there is another way:

You can use .indices of subset, which referes to indices in the original dataset selected for subset.

i.e.

train_classes = [dataset.targets[i] for i in train_dataset.indices]
Counter(train_classes) # if doesn' work: Counter(i.item() for i in train_classes)

160

answered Oct 09 '22 21:10

kHarshit

Related questions
                            
                                How to format a float with a comma as decimal separator in an f-string?
                            
                                Changing Size of Legend in Altair
                            
                                Installing local packages with Python virtualenv --system-site-packages
                            
                                How to profile large datasets with Pandas profiling?
                            
                                Pytest marks: mark entire directory / package
                            
                                Ways to Plot Spark Dataframe without Converting it to Pandas
                            
                                Load testing on an API using python
                            
                                How to convert Numpy to Parquet without using Pandas?
                            
                                How to get the height of a tkinter window title bar
                            
                                Fastest way to construct pyarrow table row by row
                            
                                How to break a long line in a hover text Plotly?
                            
                                getmtime() vs datetime.now():
                            
                                Python3 multiprocessing terminate vs kill vs close
                            
                                how do you install poppler on google colab
                            
                                Adding a pause in Google-text-to-speech
                            
                                aws lambda not logging print statements
                            
                                Plotly express bar chart colour change
                            
                                Updated to Python 3.8 - Terminal won't open - [Fixed] but details not understood
                            
                                Extracting Key-Phrases from text based on the Topic with Python
                            
                                How Can I Make My Bullets Look LIke They Are Comming Out Of My Guns Tip?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Number of instances per class in pytorch dataset

Tags:

python

pytorch

torch

dataloader

Amin Bashiri

People also ask

1 Answers

kHarshit

Recent Activity

Donate For Us