So this is a minimal code which illustrates the issue:
This is the Dataset:
class IceShipDataset(Dataset):
BAND1='band_1'
BAND2='band_2'
IMAGE='image'
@staticmethod
def get_band_img(sample,band):
pic_size=75
img=np.array(sample[band])
img.resize(pic_size,pic_size)
return img
def __init__(self,data,transform=None):
self.data=data
self.transform=transform
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
sample=self.data[idx]
band1_img=IceShipDataset.get_band_img(sample,self.BAND1)
band2_img=IceShipDataset.get_band_img(sample,self.BAND2)
img=np.stack([band1_img,band2_img],2)
sample[self.IMAGE]=img
if self.transform is not None:
sample=self.transform(sample)
return sample
And this is the code which fails:
PLAY_BATCH_SIZE=4
#load data. There are 1604 examples.
with open('train.json','r') as f:
data=f.read()
data=json.loads(data)
ds=IceShipDataset(data)
playloader = torch.utils.data.DataLoader(ds, batch_size=PLAY_BATCH_SIZE,
shuffle=False, num_workers=4)
for i,data in enumerate(playloader):
print(i)
It gives that weird open files error in the for loop… My torch version is 0.3.0.post4
If you want the json file, it is available at Kaggle (https://www.kaggle.com/c/statoil-iceberg-classifier-challenge)
I should mention that the error has nothing to do with the state of my laptop:
yoni@yoni-Lenovo-Z710:~$ lsof | wc -l
89114
yoni@yoni-Lenovo-Z710:~$ cat /proc/sys/fs/file-max
791958
What am I doing wrong here?
collate_fn is called with a list of data samples at each time. It is expected to collate the input samples into a batch for yielding from the data loader iterator.
PyTorch dataloader batch size The batch size is equal to the number of samples in the training data. Code: In the following code, we will import the torch module from which we can process the number of samples before the model is updated. datasets = impdataset(1001) is used as a dataset.
Num_workers tells the data loader instance how many sub-processes to use for data loading. If the num_worker is zero (default) the GPU has to weight for CPU to load data. Theoretically, greater the num_workers, more efficiently the CPU load data and less the GPU has to wait.
Dataset stores the samples and their corresponding labels, and DataLoader wraps an iterable around the Dataset to enable easy access to the samples.
I know how to fix the error, but I don't have a complete explanation for why it happens.
First, the solution: you need to make sure that the image data is stored as numpy.array
s, when you call json.loads
it loads them as python list
s of float
s. This causes the torch.utils.data.DataLoader
to individually transform each float in the list into a torch.DoubleTensor
.
Have a look at default_collate
in torch.utils.data.DataLoader
- your __getitem__
returns a dict
which is a mapping, so default_collate
gets called again on each element of the dict
. The first couple are int
s, but then you get to the image data which is a list
, i.e. a collections.Sequence
- this is where things get funky as default_collate
is called on each element of the list. This is clearly not what you intended. I don't know what the assumption in torch
is about the contents of a list
versus a numpy.array
, but given the error it would appear that that assumption is being violated.
The fix is pretty trivial, just make sure the two image bands are numpy.array
s, for instance in __init__
def __init__(self,data,transform=None):
self.data=[]
for d in data:
d[self.BAND1] = np.asarray(d[self.BAND1])
d[self.BAND2] = np.asarray(d[self.BAND2])
self.data.append(d)
self.transform=transform
or after you load the json, what ever - doesn't really matter where you do it, as long as you do it.
Why does the above results in too many open files
?
I don't know, but as the comments pointed out, it is likely to do with interprocess communication and lock files on the two queues data is taken from and added to.
Footnote: the train.json
was not available for download from Kaggle due to the competition still being open (??). I made a dummy json file that should have the same structure and tested the fix on that dummy file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With