So this is a minimal code which illustrates the issue: This is the Dataset: <pre class="prettyprint"><code>class IceShipDataset(Dataset): BAND1='band_1' BAND2='band_2' IMAGE='image' @staticmethod def get_band_img(sample,band): pic_size=75 img=np.array(sample[band]) img.resize(pic_size,pic_size) return img def __init__(self,data,transform=None): self.data=data self.transform=transform def __len__(self): return len(self.data) def __getitem__(self, idx): sample=self.data[idx] band1_img=IceShipDataset.get_band_img(sample,self.BAND1) band2_img=IceShipDataset.get_band_img(sample,self.BAND2) img=np.stack([band1_img,band2_img],2) sample[self.IMAGE]=img if self.transform is not None: sample=self.transform(sample) return sample </code></pre> And this is the code which fails: <pre class="prettyprint"><code>PLAY_BATCH_SIZE=4 #load data. There are 1604 examples. with open('train.json','r') as f: data=f.read() data=json.loads(data) ds=IceShipDataset(data) playloader = torch.utils.data.DataLoader(ds, batch_size=PLAY_BATCH_SIZE, shuffle=False, num_workers=4) for i,data in enumerate(playloader): print(i) </code></pre> It gives that weird open files error in the for loop… My torch version is 0.3.0.post4 If you want the json file, it is available at Kaggle (https://www.kaggle.com/c/statoil-iceberg-classifier-challenge) I should mention that the error has nothing to do with the state of my laptop: <pre class="prettyprint"><code>yoni@yoni-Lenovo-Z710:~$ lsof | wc -l 89114 yoni@yoni-Lenovo-Z710:~$ cat /proc/sys/fs/file-max 791958 </code></pre> What am I doing wrong here?

I know how to fix the error, but I don't have a complete explanation for why it happens. First, the solution: you need to make sure that the image data is stored as <code>numpy.array</code>s, when you call <code>json.loads</code> it loads them as python <code>list</code>s of <code>float</code>s. This causes the <code>torch.utils.data.DataLoader</code> to individually transform each float in the list into a <code>torch.DoubleTensor</code>. Have a look at <code>default_collate</code> in <code>torch.utils.data.DataLoader</code> - your <code>__getitem__</code> returns a <code>dict</code> which is a mapping, so <code>default_collate</code> gets called again on each element of the <code>dict</code>. The first couple are <code>int</code>s, but then you get to the image data which is a <code>list</code>, i.e. a <code>collections.Sequence</code> - this is where things get funky as <code>default_collate</code> is called on each element of the list. This is clearly not what you intended. I don't know what the assumption in <code>torch</code> is about the contents of a <code>list</code> versus a <code>numpy.array</code>, but given the error it would appear that that assumption is being violated. The fix is pretty trivial, just make sure the two image bands are <code>numpy.array</code>s, for instance in <code>__init__</code> <pre class="prettyprint"><code>def __init__(self,data,transform=None): self.data=[] for d in data: d[self.BAND1] = np.asarray(d[self.BAND1]) d[self.BAND2] = np.asarray(d[self.BAND2]) self.data.append(d) self.transform=transform </code></pre> or after you load the json, what ever - doesn't really matter where you do it, as long as you do it. <hr> Why does the above results in <code>too many open files</code>? I don't know, but as the comments pointed out, it is likely to do with interprocess communication and lock files on the two queues data is taken from and added to. Footnote: the <code>train.json</code> was not available for download from Kaggle due to the competition still being open (??). I made a dummy json file that should have the same structure and tested the fix on that dummy file.

PyTorch's dataloader "too many open files" error when no files should be open

Tags:

python

python-3.x

pytorch

So this is a minimal code which illustrates the issue:

This is the Dataset:

class IceShipDataset(Dataset):
    BAND1='band_1'
    BAND2='band_2'
    IMAGE='image'

    @staticmethod
    def get_band_img(sample,band):
        pic_size=75
        img=np.array(sample[band])
        img.resize(pic_size,pic_size)
        return img

    def __init__(self,data,transform=None):
        self.data=data
        self.transform=transform

    def __len__(self):
        return len(self.data)  

    def __getitem__(self, idx):

        sample=self.data[idx]
        band1_img=IceShipDataset.get_band_img(sample,self.BAND1)
        band2_img=IceShipDataset.get_band_img(sample,self.BAND2)
        img=np.stack([band1_img,band2_img],2)
        sample[self.IMAGE]=img
        if self.transform is not None:
                sample=self.transform(sample)
        return sample

And this is the code which fails:

PLAY_BATCH_SIZE=4
#load data. There are 1604 examples.
with open('train.json','r') as f:
        data=f.read()
data=json.loads(data)

ds=IceShipDataset(data)
playloader = torch.utils.data.DataLoader(ds, batch_size=PLAY_BATCH_SIZE,
                                          shuffle=False, num_workers=4)
for i,data in enumerate(playloader):
        print(i)

It gives that weird open files error in the for loop… My torch version is 0.3.0.post4

If you want the json file, it is available at Kaggle (https://www.kaggle.com/c/statoil-iceberg-classifier-challenge)

I should mention that the error has nothing to do with the state of my laptop:

yoni@yoni-Lenovo-Z710:~$ lsof | wc -l
89114
yoni@yoni-Lenovo-Z710:~$ cat /proc/sys/fs/file-max
791958

What am I doing wrong here?

564

asked Jan 14 '18 13:01

Yoni Keren

1 Answers

I know how to fix the error, but I don't have a complete explanation for why it happens.

First, the solution: you need to make sure that the image data is stored as numpy.arrays, when you call json.loads it loads them as python lists of floats. This causes the torch.utils.data.DataLoader to individually transform each float in the list into a torch.DoubleTensor.

Have a look at default_collate in torch.utils.data.DataLoader - your __getitem__ returns a dict which is a mapping, so default_collate gets called again on each element of the dict. The first couple are ints, but then you get to the image data which is a list, i.e. a collections.Sequence - this is where things get funky as default_collate is called on each element of the list. This is clearly not what you intended. I don't know what the assumption in torch is about the contents of a list versus a numpy.array, but given the error it would appear that that assumption is being violated.

The fix is pretty trivial, just make sure the two image bands are numpy.arrays, for instance in __init__

def __init__(self,data,transform=None):
    self.data=[]
    for d in data:
        d[self.BAND1] = np.asarray(d[self.BAND1])
        d[self.BAND2] = np.asarray(d[self.BAND2])
        self.data.append(d)
    self.transform=transform

or after you load the json, what ever - doesn't really matter where you do it, as long as you do it.

Why does the above results in too many open files?

I don't know, but as the comments pointed out, it is likely to do with interprocess communication and lock files on the two queues data is taken from and added to.

Footnote: the train.json was not available for download from Kaggle due to the competition still being open (??). I made a dummy json file that should have the same structure and tested the fix on that dummy file.

118

answered Oct 07 '22 09:10

Matti Lyra

Related questions
                            
                                Is PEP 585 unusable at runtime under Python 3.7 and 3.8?
                            
                                What alarm/access hardware can I control from *NIX?
                            
                                A django model that subclasses an abc, gives a metaclass conflict
                            
                                Dealing with timeseries gaps in Chaco
                            
                                Can't delete cache for specific entry in Django
                            
                                HTML to RTF string using Python
                            
                                Conversion of unicode minus sign ( from matplotlib ticklabels )
                            
                                from past import print_statement
                            
                                Are all post-install options for python setuptools broken?
                            
                                Alembic --autogenerate tries to recreate every table
                            
                                Implementing common random numbers in a simulation
                            
                                How to create a plot in matplotlib without using pyplot
                            
                                How to join a Series to a DataFrame?
                            
                                OSError: raw readinto() returned invalid length when use websockets
                            
                                Need help designing fitness evaluation for a NEAT algorithm-based neural network
                            
                                Wide & Deep learning for large data error: GraphDef cannot be larger than 2GB
                            
                                Python, PyInstaller error: no module named "Encodings" and system codec missing
                            
                                Fully convolutional network - different size of images in training data
                            
                                How to segment bent rod for angle calculations?
                            
                                Odoo10/Odoo11 dynamic progressbar - trigger javascript function inside python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With