Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PyTorch's dataloader "too many open files" error when no files should be open

So this is a minimal code which illustrates the issue:

This is the Dataset:

class IceShipDataset(Dataset):
    BAND1='band_1'
    BAND2='band_2'
    IMAGE='image'

    @staticmethod
    def get_band_img(sample,band):
        pic_size=75
        img=np.array(sample[band])
        img.resize(pic_size,pic_size)
        return img

    def __init__(self,data,transform=None):
        self.data=data
        self.transform=transform

    def __len__(self):
        return len(self.data)  

    def __getitem__(self, idx):

        sample=self.data[idx]
        band1_img=IceShipDataset.get_band_img(sample,self.BAND1)
        band2_img=IceShipDataset.get_band_img(sample,self.BAND2)
        img=np.stack([band1_img,band2_img],2)
        sample[self.IMAGE]=img
        if self.transform is not None:
                sample=self.transform(sample)
        return sample

And this is the code which fails:

PLAY_BATCH_SIZE=4
#load data. There are 1604 examples.
with open('train.json','r') as f:
        data=f.read()
data=json.loads(data)

ds=IceShipDataset(data)
playloader = torch.utils.data.DataLoader(ds, batch_size=PLAY_BATCH_SIZE,
                                          shuffle=False, num_workers=4)
for i,data in enumerate(playloader):
        print(i)

It gives that weird open files error in the for loop… My torch version is 0.3.0.post4

If you want the json file, it is available at Kaggle (https://www.kaggle.com/c/statoil-iceberg-classifier-challenge)

I should mention that the error has nothing to do with the state of my laptop:

yoni@yoni-Lenovo-Z710:~$ lsof | wc -l
89114
yoni@yoni-Lenovo-Z710:~$ cat /proc/sys/fs/file-max
791958

What am I doing wrong here?

like image 564
Yoni Keren Avatar asked Jan 14 '18 13:01

Yoni Keren


People also ask

What is collate function in DataLoader?

collate_fn is called with a list of data samples at each time. It is expected to collate the input samples into a batch for yielding from the data loader iterator.

What is batch size in DataLoader PyTorch?

PyTorch dataloader batch size The batch size is equal to the number of samples in the training data. Code: In the following code, we will import the torch module from which we can process the number of samples before the model is updated. datasets = impdataset(1001) is used as a dataset.

What is Num_workers in DataLoader?

Num_workers tells the data loader instance how many sub-processes to use for data loading. If the num_worker is zero (default) the GPU has to weight for CPU to load data. Theoretically, greater the num_workers, more efficiently the CPU load data and less the GPU has to wait.

What is the difference between a dataset and DataLoader?

Dataset stores the samples and their corresponding labels, and DataLoader wraps an iterable around the Dataset to enable easy access to the samples.


1 Answers

I know how to fix the error, but I don't have a complete explanation for why it happens.

First, the solution: you need to make sure that the image data is stored as numpy.arrays, when you call json.loads it loads them as python lists of floats. This causes the torch.utils.data.DataLoader to individually transform each float in the list into a torch.DoubleTensor.

Have a look at default_collate in torch.utils.data.DataLoader - your __getitem__ returns a dict which is a mapping, so default_collate gets called again on each element of the dict. The first couple are ints, but then you get to the image data which is a list, i.e. a collections.Sequence - this is where things get funky as default_collate is called on each element of the list. This is clearly not what you intended. I don't know what the assumption in torch is about the contents of a list versus a numpy.array, but given the error it would appear that that assumption is being violated.

The fix is pretty trivial, just make sure the two image bands are numpy.arrays, for instance in __init__

def __init__(self,data,transform=None):
    self.data=[]
    for d in data:
        d[self.BAND1] = np.asarray(d[self.BAND1])
        d[self.BAND2] = np.asarray(d[self.BAND2])
        self.data.append(d)
    self.transform=transform

or after you load the json, what ever - doesn't really matter where you do it, as long as you do it.


Why does the above results in too many open files?

I don't know, but as the comments pointed out, it is likely to do with interprocess communication and lock files on the two queues data is taken from and added to.

Footnote: the train.json was not available for download from Kaggle due to the competition still being open (??). I made a dummy json file that should have the same structure and tested the fix on that dummy file.

like image 118
Matti Lyra Avatar answered Oct 07 '22 09:10

Matti Lyra