I have a large number of files (>1,000) stored in an S3 bucket, and I would like to iterate over them (e.g. in a <code>for</code> loop) to extract data from them using <code>boto3</code>. However, I notice that in accordance with http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.list_objects, the <code>list_objects()</code> method of the <code>Client</code> class only lists up to 1,000 objects: <pre class="prettyprint"><code>In [1]: import boto3 In [2]: client = boto3.client('s3') In [11]: apks = client.list_objects(Bucket='iper-apks') In [16]: type(apks['Contents']) Out[16]: list In [17]: len(apks['Contents']) Out[17]: 1000 </code></pre> However, I would like to list all the objects, even if there are more than 1,000. How could I achieve this?

As kurt-peek notes, <code>boto3</code> has a <code>Paginator</code> class, which allows you to iterator over pages of s3 objects, and can easily be used to provide an iterator over items within the pages: <pre class="prettyprint"><code>import boto3 def iterate_bucket_items(bucket): """ Generator that iterates over all objects in a given s3 bucket See http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.list_objects_v2 for return data format :param bucket: name of s3 bucket :return: dict of metadata for an object """ client = boto3.client('s3') paginator = client.get_paginator('list_objects_v2') page_iterator = paginator.paginate(Bucket=bucket) for page in page_iterator: if page['KeyCount'] > 0: for item in page['Contents']: yield item for i in iterate_bucket_items(bucket='my_bucket'): print i </code></pre> Which will output something like: <pre class="prettyprint"><code>{u'ETag': '"a8a9ee11bd4766273ab4b54a0e97c589"', u'Key': '2017-06-01-10-17-57-EBDC490AD194E7BF', u'LastModified': datetime.datetime(2017, 6, 1, 10, 17, 58, tzinfo=tzutc()), u'Size': 242, u'StorageClass': 'STANDARD'} {u'ETag': '"03be0b66e34cbc4c037729691cd5efab"', u'Key': '2017-06-01-10-28-58-732EB022229AACF7', u'LastModified': datetime.datetime(2017, 6, 1, 10, 28, 59, tzinfo=tzutc()), u'Size': 238, u'StorageClass': 'STANDARD'} ... </code></pre> Note that <code>list_objects_v2</code> is recommended instead of <code>list_objects</code>: https://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html You can also do this at a lower level by calling <code>list_objects_v2()</code> directly and passing in the <code>NextContinuationToken</code> value from the response as <code>ContinuationToken</code> while <code>isTruncated</code> is true in the response.

How to iterate over files in an S3 bucket?

Tags:

python

amazon-web-services

amazon-s3

boto3

I have a large number of files (>1,000) stored in an S3 bucket, and I would like to iterate over them (e.g. in a for loop) to extract data from them using boto3.

However, I notice that in accordance with http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.list_objects, the list_objects() method of the Client class only lists up to 1,000 objects:

In [1]: import boto3

In [2]: client = boto3.client('s3')

In [11]: apks = client.list_objects(Bucket='iper-apks')

In [16]: type(apks['Contents'])
Out[16]: list

In [17]: len(apks['Contents'])
Out[17]: 1000

However, I would like to list all the objects, even if there are more than 1,000. How could I achieve this?

817

asked May 29 '17 09:05

Kurt Peek

2 Answers

As kurt-peek notes, boto3 has a Paginator class, which allows you to iterator over pages of s3 objects, and can easily be used to provide an iterator over items within the pages:

import boto3


def iterate_bucket_items(bucket):
    """
    Generator that iterates over all objects in a given s3 bucket

    See http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.list_objects_v2 
    for return data format
    :param bucket: name of s3 bucket
    :return: dict of metadata for an object
    """


    client = boto3.client('s3')
    paginator = client.get_paginator('list_objects_v2')
    page_iterator = paginator.paginate(Bucket=bucket)

    for page in page_iterator:
        if page['KeyCount'] > 0:
            for item in page['Contents']:
                yield item


for i in iterate_bucket_items(bucket='my_bucket'):
    print i

Which will output something like:

{u'ETag': '"a8a9ee11bd4766273ab4b54a0e97c589"',
 u'Key': '2017-06-01-10-17-57-EBDC490AD194E7BF',
 u'LastModified': datetime.datetime(2017, 6, 1, 10, 17, 58, tzinfo=tzutc()),
 u'Size': 242,
 u'StorageClass': 'STANDARD'}
{u'ETag': '"03be0b66e34cbc4c037729691cd5efab"',
 u'Key': '2017-06-01-10-28-58-732EB022229AACF7',
 u'LastModified': datetime.datetime(2017, 6, 1, 10, 28, 59, tzinfo=tzutc()),
 u'Size': 238,
 u'StorageClass': 'STANDARD'}
...

Note that list_objects_v2 is recommended instead of list_objects: https://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html

You can also do this at a lower level by calling list_objects_v2() directly and passing in the NextContinuationToken value from the response as ContinuationToken while isTruncated is true in the response.

answered Oct 12 '22 05:10

John Carter

I found out that boto3 has a Paginator class to deal with truncated results. The following worked for me:

paginator = client.get_paginator('list_objects')
page_iterator = paginator.paginate(Bucket='iper-apks')

after which I can use the page_iterator generator in a for loop.

answered Oct 12 '22 05:10

Kurt Peek

Related questions
                            
                                simple graphics for python
                            
                                Replace all text between 2 strings python
                            
                                Showing Pandas data frame as a table
                            
                                How to assign to a Django PointField model attribute?
                            
                                Django email with smtp.gmail SMTPAuthenticationError 534 Application-specific password required
                            
                                Percentage Overlap of Two Lists
                            
                                PySpark add a column to a DataFrame from a TimeStampType column
                            
                                Python Click command names
                            
                                Generate random number outside of range in python
                            
                                understanding python xgboost cv
                            
                                how to hide "py4j.java_gateway:Received command c on object id p0"?
                            
                                Can a method within a class be generator?
                            
                                Simple multithread for loop in Python
                            
                                Pass parameters to schedule
                            
                                Is it possible for Airflow scheduler to first finish the previous day's cycle before starting the next?
                            
                                Python df.to_excel() storing numbers as text in excel. How to store as Value?
                            
                                Creating a virtualenv with preinstalled packages as in requirements.txt
                            
                                Create array of json objects from for loops
                            
                                Python converting from base64 to binary
                            
                                tkinter Canvas Scrollbar with Grid?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to iterate over files in an S3 bucket?

Tags:

python

amazon-web-services

amazon-s3

boto3

Kurt Peek

People also ask

2 Answers

John Carter

Kurt Peek

Recent Activity

Donate For Us