Using boto3, how can I retrieve all files in my S3 bucket without retrieving the folders? Consider the following file structure: <pre class="prettyprint"><code>file_1.txt folder_1/ file_2.txt file_3.txt folder_2/ folder_3/ file_4.txt </code></pre> In this example Im only interested in the 4 files. EDIT: A manual solution is: <pre class="prettyprint"><code>def count_files_in_folder(prefix): total = 0 keys = s3_client.list_objects(Bucket=bucket_name, Prefix=prefix) for key in keys['Contents']: if key['Key'][-1:] != '/': total += 1 return total </code></pre> In this case total would be 4. If I just did <pre class="prettyprint"><code>count = len(s3_client.list_objects(Bucket=bucket_name, Prefix=prefix)) </code></pre> the result would be 7 objects (4 files and 3 folders): <pre class="prettyprint"><code>file.txt folder_1/ folder_1/file_2.txt folder_1/file_3.txt folder_1/folder_2/ folder_1/folder_2/folder_3/ folder_1/folder_2/folder_3/file_4.txt </code></pre> I JUST want: <pre class="prettyprint"><code>file.txt folder_1/file_2.txt folder_1/file_3.txt folder_1/folder_2/folder_3/file_4.txt </code></pre>

There are no folders in S3. What you have is four files named: <pre class="prettyprint"><code>file_1.txt folder_1/file_2.txt folder_1/file_3.txt folder_1/folder_2/folder_3/file_4.txt </code></pre> Those are the actual names of the objects in S3. If what you want is to end up with: <pre class="prettyprint"><code>file_1.txt file_2.txt file_3.txt file_4.txt </code></pre> all sitting in the same directory on a local file system you would need to manipulate the name of the object to strip out just the file name. Something like this would work: <pre class="prettyprint"><code>import os.path full_name = 'folder_1/folder_2/folder_3/file_4.txt' file_name = os.path.basename(full_name) </code></pre> The variable <code>file_name</code> would then contain <code>'file_4.txt'</code>.

Boto3 S3: Get files without getting folders

Tags:

python

amazon-web-services

amazon-s3

boto3

Using boto3, how can I retrieve all files in my S3 bucket without retrieving the folders?

Consider the following file structure:

file_1.txt
folder_1/
    file_2.txt
    file_3.txt
    folder_2/
        folder_3/
            file_4.txt

In this example Im only interested in the 4 files.

EDIT:

A manual solution is:

def count_files_in_folder(prefix):
    total = 0
    keys = s3_client.list_objects(Bucket=bucket_name, Prefix=prefix)
    for key in keys['Contents']:
        if key['Key'][-1:] != '/':
            total += 1
    return total

In this case total would be 4.

If I just did

count = len(s3_client.list_objects(Bucket=bucket_name, Prefix=prefix))

the result would be 7 objects (4 files and 3 folders):

file.txt
folder_1/
folder_1/file_2.txt
folder_1/file_3.txt
folder_1/folder_2/
folder_1/folder_2/folder_3/
folder_1/folder_2/folder_3/file_4.txt

I JUST want:

file.txt
folder_1/file_2.txt
folder_1/file_3.txt  
folder_1/folder_2/folder_3/file_4.txt

439

asked Mar 08 '17 14:03

Vingtoft

2 Answers

There are no folders in S3. What you have is four files named:

file_1.txt
folder_1/file_2.txt
folder_1/file_3.txt
folder_1/folder_2/folder_3/file_4.txt

Those are the actual names of the objects in S3. If what you want is to end up with:

file_1.txt
file_2.txt
file_3.txt
file_4.txt

all sitting in the same directory on a local file system you would need to manipulate the name of the object to strip out just the file name. Something like this would work:

import os.path

full_name = 'folder_1/folder_2/folder_3/file_4.txt'
file_name = os.path.basename(full_name)

The variable file_name would then contain 'file_4.txt'.

answered Sep 20 '22 17:09

garnaat

S3 is an OBJECT STORE. It DOES NOT store file/object under directories tree. New comer always confuse the "folder" option given by them, which in fact an arbitrary prefix for the object.

object PREFIX is a way to retrieve your object organised by predefined fix file name(key) prefix structure, e.g. .

You can imagine using a file system that don't allow you to create a directory, but allow you to create file name with a slash "/" or backslash "\" as delimiter, and you can denote "level" of the file by a common prefix.

Thus in S3, you can use following to "simulate directory" that is not a directory.

folder1-folder2-folder3-myobject
folder1/folder2/folder3/myobject
folder1\folder2\folder3\myobject

As you can see, object name can store inside S3 regardless what kind of arbitrary folder separator(delimiter) you use.

However, to help user to make bulks file transfer to S3, tools such as aws cli, s3_transfer api attempt to simplify the step and create object name follow your input local folder structure.

So if you are sure that all the S3 object is using / or \ as separator , you can use tools like S3transfer or AWSCcli to make a simple download by using the key name.

Here is the quick and dirty code using the resource iterator. Using s3.resource.object.filter will return iterator that doesn't have same 1000 keys limit as list_objects()/list_objects_v2().

import os 
import boto3
s3 = boto3.resource('s3')
mybucket = s3.Bucket("mybucket")
# if blank prefix is given, return everything)
bucket_prefix="/some/prefix/here"
objs = mybucket.objects.filter(
    Prefix = bucket_prefix)

for obj in objs:
    path, filename = os.path.split(obj.key)
    # boto3 s3 download_file will throw exception if folder not exists
    try:
        os.makedirs(path) 
    except FileExistsError:
        pass
    mybucket.download_file(obj.key, obj.key)

answered Sep 22 '22 17:09

mootmoot

Related questions
                            
                                Exclude field from values() or values_list()
                            
                                Split on either a space or a hyphen?
                            
                                Saving plot from seaborn
                            
                                ValueError: No axis named node2 for object type <class 'pandas.core.frame.DataFrame'>
                            
                                Split/Expand Dataframe based on column values
                            
                                Return a list of weekdays, starting with given weekday
                            
                                Python : Why use "list[:]" when "list" refers to same thing?
                            
                                What is the proper way to track indexes in python?
                            
                                Stubbing out functions or classes
                            
                                How can I hide a django label in a custom django form?
                            
                                Comparing two lists and only printing the differences? (XORing two lists)
                            
                                Multiple constructors in python, using inheritance
                            
                                pyspark and HDFS commands
                            
                                Making histogram with Spark DataFrame column
                            
                                Why am I getting the error: No module named 'email.MIMEMultipart'?
                            
                                How do order of operations go on Python?
                            
                                Is there a function in Python to split a string without ignoring the spaces?
                            
                                How can I capture the stdout output of a child process?
                            
                                Should I Start With Python 3.0? [closed]
                            
                                Set paragraph font in python-docx

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With