I have a large number of files (>1,000) stored in an S3 bucket, and I would like to iterate over them (e.g. in a for
loop) to extract data from them using boto3
.
However, I notice that in accordance with http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.list_objects, the list_objects()
method of the Client
class only lists up to 1,000 objects:
In [1]: import boto3
In [2]: client = boto3.client('s3')
In [11]: apks = client.list_objects(Bucket='iper-apks')
In [16]: type(apks['Contents'])
Out[16]: list
In [17]: len(apks['Contents'])
Out[17]: 1000
However, I would like to list all the objects, even if there are more than 1,000. How could I achieve this?
Reading objects without downloading them Similarly, if you want to upload and read small pieces of textual data such as quotes, tweets, or news articles, you can do that using the S3 resource method put(), as demonstrated in the example below (Gist).
Open the AWS S3 console and click on your bucket's name. In the Objects tab, click the top row checkbox to select all files and folders or select the folders you want to count the files for. Click on the Actions button and select Calculate total size.
It's no wonder it is one of the most popular data storage options around. Ideally, Amazon claims quite a bit about S3 performance benchmarks. 55,000 read requests per second, 100–200 milliseconds small object latencies, and more.
S3 provides unlimited scalability, and there is no official limit on the amount of data and number of objects you can store in an S3 bucket. The size limit for objects stored in a bucket is 5 TB.
As kurt-peek notes, boto3
has a Paginator
class, which allows you to iterator over pages of s3 objects, and can easily be used to provide an iterator over items within the pages:
import boto3
def iterate_bucket_items(bucket):
"""
Generator that iterates over all objects in a given s3 bucket
See http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.list_objects_v2
for return data format
:param bucket: name of s3 bucket
:return: dict of metadata for an object
"""
client = boto3.client('s3')
paginator = client.get_paginator('list_objects_v2')
page_iterator = paginator.paginate(Bucket=bucket)
for page in page_iterator:
if page['KeyCount'] > 0:
for item in page['Contents']:
yield item
for i in iterate_bucket_items(bucket='my_bucket'):
print i
Which will output something like:
{u'ETag': '"a8a9ee11bd4766273ab4b54a0e97c589"',
u'Key': '2017-06-01-10-17-57-EBDC490AD194E7BF',
u'LastModified': datetime.datetime(2017, 6, 1, 10, 17, 58, tzinfo=tzutc()),
u'Size': 242,
u'StorageClass': 'STANDARD'}
{u'ETag': '"03be0b66e34cbc4c037729691cd5efab"',
u'Key': '2017-06-01-10-28-58-732EB022229AACF7',
u'LastModified': datetime.datetime(2017, 6, 1, 10, 28, 59, tzinfo=tzutc()),
u'Size': 238,
u'StorageClass': 'STANDARD'}
...
Note that list_objects_v2
is recommended instead of list_objects
: https://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html
You can also do this at a lower level by calling list_objects_v2()
directly and passing in the NextContinuationToken
value from the response as ContinuationToken
while isTruncated
is true in the response.
I found out that boto3
has a Paginator class to deal with truncated results. The following worked for me:
paginator = client.get_paginator('list_objects')
page_iterator = paginator.paginate(Bucket='iper-apks')
after which I can use the page_iterator
generator in a for
loop.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With