Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to iterate over files in an S3 bucket?

I have a large number of files (>1,000) stored in an S3 bucket, and I would like to iterate over them (e.g. in a for loop) to extract data from them using boto3.

However, I notice that in accordance with http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.list_objects, the list_objects() method of the Client class only lists up to 1,000 objects:

In [1]: import boto3

In [2]: client = boto3.client('s3')

In [11]: apks = client.list_objects(Bucket='iper-apks')

In [16]: type(apks['Contents'])
Out[16]: list

In [17]: len(apks['Contents'])
Out[17]: 1000

However, I would like to list all the objects, even if there are more than 1,000. How could I achieve this?

like image 817
Kurt Peek Avatar asked May 29 '17 09:05

Kurt Peek


People also ask

Can I read S3 file without downloading?

Reading objects without downloading them Similarly, if you want to upload and read small pieces of textual data such as quotes, tweets, or news articles, you can do that using the S3 resource method put(), as demonstrated in the example below (Gist).

How do I see how many files are in a S3 bucket?

Open the AWS S3 console and click on your bucket's name. In the Objects tab, click the top row checkbox to select all files and folders or select the folders you want to count the files for. Click on the Actions button and select Calculate total size.

How fast can you read from S3?

It's no wonder it is one of the most popular data storage options around. Ideally, Amazon claims quite a bit about S3 performance benchmarks. 55,000 read requests per second, 100–200 milliseconds small object latencies, and more.

Is there a limit to the number of files in S3 bucket?

S3 provides unlimited scalability, and there is no official limit on the amount of data and number of objects you can store in an S3 bucket. The size limit for objects stored in a bucket is 5 TB.


2 Answers

As kurt-peek notes, boto3 has a Paginator class, which allows you to iterator over pages of s3 objects, and can easily be used to provide an iterator over items within the pages:

import boto3


def iterate_bucket_items(bucket):
    """
    Generator that iterates over all objects in a given s3 bucket

    See http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.list_objects_v2 
    for return data format
    :param bucket: name of s3 bucket
    :return: dict of metadata for an object
    """


    client = boto3.client('s3')
    paginator = client.get_paginator('list_objects_v2')
    page_iterator = paginator.paginate(Bucket=bucket)

    for page in page_iterator:
        if page['KeyCount'] > 0:
            for item in page['Contents']:
                yield item


for i in iterate_bucket_items(bucket='my_bucket'):
    print i

Which will output something like:

{u'ETag': '"a8a9ee11bd4766273ab4b54a0e97c589"',
 u'Key': '2017-06-01-10-17-57-EBDC490AD194E7BF',
 u'LastModified': datetime.datetime(2017, 6, 1, 10, 17, 58, tzinfo=tzutc()),
 u'Size': 242,
 u'StorageClass': 'STANDARD'}
{u'ETag': '"03be0b66e34cbc4c037729691cd5efab"',
 u'Key': '2017-06-01-10-28-58-732EB022229AACF7',
 u'LastModified': datetime.datetime(2017, 6, 1, 10, 28, 59, tzinfo=tzutc()),
 u'Size': 238,
 u'StorageClass': 'STANDARD'}
...

Note that list_objects_v2 is recommended instead of list_objects: https://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html

You can also do this at a lower level by calling list_objects_v2() directly and passing in the NextContinuationToken value from the response as ContinuationToken while isTruncated is true in the response.

like image 75
John Carter Avatar answered Oct 12 '22 05:10

John Carter


I found out that boto3 has a Paginator class to deal with truncated results. The following worked for me:

paginator = client.get_paginator('list_objects')
page_iterator = paginator.paginate(Bucket='iper-apks')

after which I can use the page_iterator generator in a for loop.

like image 28
Kurt Peek Avatar answered Oct 12 '22 05:10

Kurt Peek