I have an S3 bucket. Inside the bucket, we have a folder for the year, 2018, and some files we have collected for each month and day. So, as an example, 2018\3\24, 2018\3\25 so forth and so on. We didn't put the dates in the files inside each days bucket. Basically, I want to iterate through the bucket and use the folders structure to classify each file by it's 'date' since we need to load it into a different database and will need a way to identify. I've read a ton of posts on using boto3, and iterating through however there seem to be conflicting details on if what I need can be done. If there's an easier way of doing this please suggest. I got it close import boto3 <pre class="prettyprint"><code>s3client = boto3.client('s3') bucket = 'bucketname' startAfter = '2018' s3objects= s3client.list_objects_v2(Bucket=bucket, StartAfter=startAfter ) for object in s3objects['Contents']: print(object['Key']) </code></pre>

When using boto3 you can only list 1000 objects per request. So to obtain all the objects in the bucket, you can use s3's paginator. <code>client.get_paginator('list_objects_v2')</code> is what you need. Something like this is what you need: <pre class="prettyprint"><code>import boto3 client = boto3.client('s3') paginator = client.get_paginator('list_objects_v2') result = paginator.paginate(Bucket='bucketname',StartAfter='2018') for page in result: if "Contents" in page: for key in page[ "Contents" ]: keyString = key[ "Key" ] print keyString </code></pre> From this documentation: <blockquote> list_objects: Returns some or all (up to 1000) of the objects in a bucket. You can use the request parameters as selection criteria to return a subset of the objects in a bucket. list_objects_v2: Returns some or all (up to 1000) of the objects in a bucket. You can use the request parameters as selection criteria to return a subset of the objects in a bucket. Note: ListObjectsV2 is the revised List Objects API and we recommend you use this revised API for new application development. </blockquote> From this answer: <blockquote> <code>list_objects_v2</code> has added features. Due to the 1000 keys per page listing limits, using marker to list multiple pages can be an headache. Logically, you need to keep track the last key you successfully processed. With <code>ContinuationToken</code>, you don't need to know the last key, you just check existence of <code>NextContinuationToken</code> in the response. You can spawn parallel process to deal with multiple of 1000 keys without dealing with the last key to fetch next page. </blockquote>

Iterate over files in an S3 bucket with folder structure

Tags:

python-3.x

amazon-s3

amazon-ec2

I have an S3 bucket. Inside the bucket, we have a folder for the year, 2018, and some files we have collected for each month and day. So, as an example, 2018\3\24, 2018\3\25 so forth and so on.

We didn't put the dates in the files inside each days bucket.

Basically, I want to iterate through the bucket and use the folders structure to classify each file by it's 'date' since we need to load it into a different database and will need a way to identify.

I've read a ton of posts on using boto3, and iterating through however there seem to be conflicting details on if what I need can be done.

If there's an easier way of doing this please suggest.

I got it close import boto3

s3client = boto3.client('s3')
bucket = 'bucketname'
startAfter = '2018'

s3objects= s3client.list_objects_v2(Bucket=bucket, StartAfter=startAfter )
for object in s3objects['Contents']:
    print(object['Key'])

393

asked Mar 26 '18 00:03

DataDog

1 Answers

When using boto3 you can only list 1000 objects per request. So to obtain all the objects in the bucket, you can use s3's paginator.

client.get_paginator('list_objects_v2') is what you need.

Something like this is what you need:

import boto3
client = boto3.client('s3')
paginator = client.get_paginator('list_objects_v2')
result = paginator.paginate(Bucket='bucketname',StartAfter='2018')
for page in result:
    if "Contents" in page:
        for key in page[ "Contents" ]:
            keyString = key[ "Key" ]
            print keyString

From this documentation:

list_objects:

Returns some or all (up to 1000) of the objects in a bucket. You can use the request parameters as selection criteria to return a subset of the objects in a bucket.

list_objects_v2:

Returns some or all (up to 1000) of the objects in a bucket. You can use the request parameters as selection criteria to return a subset of the objects in a bucket. Note: ListObjectsV2 is the revised List Objects API and we recommend you use this revised API for new application development.

From this answer:

list_objects_v2 has added features. Due to the 1000 keys per page listing limits, using marker to list multiple pages can be an headache. Logically, you need to keep track the last key you successfully processed. With ContinuationToken, you don't need to know the last key, you just check existence of NextContinuationToken in the response. You can spawn parallel process to deal with multiple of 1000 keys without dealing with the last key to fetch next page.

140

answered Sep 29 '22 14:09

Venkatesh Wadawadagi

Related questions
                            
                                Python SyntaxError: invalid syntax, are brackets allowed in function parameters in python3?
                            
                                Override dict() on class
                            
                                StringIO generated csv file that includes BOM
                            
                                Python 3: Best practice way to validate/parse **kwargs?
                            
                                Why xrange is not defined when I'm not using xrange in the first place?
                            
                                python 3 will not find key in dict from msgpack
                            
                                Sending image over POST request with Python Requests
                            
                                converting each element from string to int in nested list in python [duplicate]
                            
                                Does the base for logarithmic calculations in Python influence the speed?
                            
                                ThreadPoolExecutor vs threading.Thread
                            
                                lazy processpoolexecutor in Python?
                            
                                How can I get mode(s) of pandas dataframe object values?
                            
                                pandas grouper issue with key that is an index
                            
                                Format a LaTeX math string in Python 3
                            
                                pandas dataframe groupby by column position
                            
                                Launch vpn with python script
                            
                                How to replace values using list comprehension in python3?
                            
                                Merging async iterables in python3
                            
                                String to Date Format as DD/MM/YYY in Python - Portuguese
                            
                                Correct Way of using Redis Connection Pool in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With