How do I implement a PyTorch Dataset for use with AWS SageMaker?

Tags:

I have implemented a PyTorch Dataset that works locally (on my own desktop), but when executed on AWS SageMaker, it breaks. My Dataset implementation is as follows.

class ImageDataset(Dataset):
    def __init__(self, path='./images', transform=None):
        self.path = path
        self.files = [join(path, f) for f in listdir(path) if isfile(join(path, f)) and f.endswith('.jpg')]
        self.transform = transform
        if transform is None:
            self.transform = transforms.Compose([
                transforms.Resize((128, 128)),
                transforms.ToTensor(),
                transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
            ])

    def __len__(self):
        return len(files)

    def __getitem__(self, idx):
        img_name = self.files[idx]

        # we may infer the label from the filename
        dash_idx = img_name.rfind('-')
        dot_idx = img_name.rfind('.')
        label = int(img_name[dash_idx + 1:dot_idx])

        image = Image.open(img_name)

        if self.transform:
            image = self.transform(image)

        return image, label

I am following this example and this one too, and I run the estimator as follows.

inputs = {
 'train': 'file://images',
 'eval': 'file://images'
}
estimator = PyTorch(entry_point='pytorch-train.py',
                            role=role,
                            framework_version='1.0.0',
                            train_instance_count=1,
                            train_instance_type=instance_type)
estimator.fit(inputs)

I get the following error.

FileNotFoundError: [Errno 2] No such file or directory: './images'

In the example that I am following, they upload the CFAIR dataset (which is downloaded locally) to S3.

inputs = sagemaker_session.upload_data(path='data', bucket=bucket, key_prefix='data/cifar10')

If I take a peek at inputs, it is just a string literal s3://sagemaker-us-east-3-184838577132/data/cifar10. The code to create a Dataset and a DataLoader is shown here, which does not help unless I track down the source and step through the logic.

I think what needs to happen inside my ImageDataset is to supply the S3 path and use the AWS CLI or something to query the files and acquire their content. I do not think the AWS CLI is the right approach as this relies on the console and I will have to execute some sub-process commands and then parse through.

There must be a recipe or something to create a custom Dataset backed by S3 files, right?

889

asked Jan 02 '19 08:01

Jane Wayne

1 Answers

I was able to create a PyTorch Dataset backed by S3 data using boto3. Here's the snippet if anyone is interested.

class ImageDataset(Dataset):
    def __init__(self, path='./images', transform=None):
        self.path = path
        self.s3 = boto3.resource('s3')
        self.bucket = self.s3.Bucket(path)
        self.files = [obj.key for obj in self.bucket.objects.all()]
        self.transform = transform
        if transform is None:
            self.transform = transforms.Compose([
                transforms.Resize((128, 128)),
                transforms.ToTensor(),
                transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
            ])

    def __len__(self):
        return len(files)

    def __getitem__(self, idx):
        img_name = self.files[idx]

        # we may infer the label from the filename
        dash_idx = img_name.rfind('-')
        dot_idx = img_name.rfind('.')
        label = int(img_name[dash_idx + 1:dot_idx])

        # we need to download the file from S3 to a temporary file locally
        # we need to create the local file name
        obj = self.bucket.Object(img_name)
        tmp = tempfile.NamedTemporaryFile()
        tmp_name = '{}.jpg'.format(tmp.name)

        # now we can actually download from S3 to a local place
        with open(tmp_name, 'wb') as f:
            obj.download_fileobj(f)
            f.flush()
            f.close()
            image = Image.open(tmp_name)

        if self.transform:
            image = self.transform(image)

        return image, label

158

answered Nov 14 '22 22:11

Jane Wayne

Related questions
                            
                                Python operator precedence with augmented assignment
                            
                                Is replace row-wise and will overwrite the value within the dict twice?
                            
                                Wildcard in dictionary key
                            
                                Check if a value exists using multiple conditions within group in pandas
                            
                                Scaling of time to broadcast an operation on 3D arrays in numpy
                            
                                Create dataframe from dictionary of list with variable length
                            
                                Find the second element from DOM using selenium and python [duplicate]
                            
                                Many-to-Many with three tables relating with each other (SqlAlchemy)
                            
                                operator precedence of floor division and division
                            
                                Pandas column reformatting
                            
                                Fastest way to drop rows / get subset with difference from large DataFrame in Pandas
                            
                                The most elegant way to modify messy and overlapping date labels below x axis? (Seaborn, barplot)
                            
                                How to draw a Table on a canvas in python using Report Lab
                            
                                Cannot find installation of real FFmpeg (which comes with ffprobe)
                            
                                AttributeError: module 'sklearn' has no attribute 'model_selection'
                            
                                can we use modelform to update an existing instance of a model?
                            
                                Speed of writing a numpy array to a text file
                            
                                Pandas sum of next n rows
                            
                                How to avoid "Incorrect padding" error while Base64 Decoding this string in Python
                            
                                Change Windows 10 background in Python 3

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I implement a PyTorch Dataset for use with AWS SageMaker?

Tags:

python

amazon-s3

pytorch

amazon-sagemaker

Jane Wayne

People also ask

1 Answers

Jane Wayne

Recent Activity

Donate For Us