Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Iterate over files in an S3 bucket with folder structure

I have an S3 bucket. Inside the bucket, we have a folder for the year, 2018, and some files we have collected for each month and day. So, as an example, 2018\3\24, 2018\3\25 so forth and so on.

We didn't put the dates in the files inside each days bucket.

Basically, I want to iterate through the bucket and use the folders structure to classify each file by it's 'date' since we need to load it into a different database and will need a way to identify.

I've read a ton of posts on using boto3, and iterating through however there seem to be conflicting details on if what I need can be done.

If there's an easier way of doing this please suggest.

I got it close import boto3

s3client = boto3.client('s3')
bucket = 'bucketname'
startAfter = '2018'

s3objects= s3client.list_objects_v2(Bucket=bucket, StartAfter=startAfter )
for object in s3objects['Contents']:
    print(object['Key'])
like image 393
DataDog Avatar asked Mar 26 '18 00:03

DataDog


People also ask

Can S3 buckets have folders?

You can have folders within folders, but not buckets within buckets. You can upload and copy objects directly into a folder.

Is it better to have multiple S3 buckets or one bucket with sub folders?

Simpler Permission with Multiple Buckets If the images are used in different use cases, using multiple buckets will simplify the permissions model, since you can give clients/users bucket level permissions instead of directory level permissions.

What is _$ folder in S3?

The "_$folder$" files are placeholders. Apache Hadoop creates these files when you use the -mkdir command to create a folder in an S3 bucket. Hadoop doesn't create the folder until you PUT the first object. If you delete the "_$folder$" files before you PUT at least one object, Hadoop can't create the folder.


1 Answers

When using boto3 you can only list 1000 objects per request. So to obtain all the objects in the bucket, you can use s3's paginator.

client.get_paginator('list_objects_v2') is what you need.

Something like this is what you need:

import boto3
client = boto3.client('s3')
paginator = client.get_paginator('list_objects_v2')
result = paginator.paginate(Bucket='bucketname',StartAfter='2018')
for page in result:
    if "Contents" in page:
        for key in page[ "Contents" ]:
            keyString = key[ "Key" ]
            print keyString

From this documentation:

list_objects:

Returns some or all (up to 1000) of the objects in a bucket. You can use the request parameters as selection criteria to return a subset of the objects in a bucket.

list_objects_v2:

Returns some or all (up to 1000) of the objects in a bucket. You can use the request parameters as selection criteria to return a subset of the objects in a bucket. Note: ListObjectsV2 is the revised List Objects API and we recommend you use this revised API for new application development.

From this answer:

list_objects_v2 has added features. Due to the 1000 keys per page listing limits, using marker to list multiple pages can be an headache. Logically, you need to keep track the last key you successfully processed. With ContinuationToken, you don't need to know the last key, you just check existence of NextContinuationToken in the response. You can spawn parallel process to deal with multiple of 1000 keys without dealing with the last key to fetch next page.

like image 140
Venkatesh Wadawadagi Avatar answered Sep 29 '22 14:09

Venkatesh Wadawadagi