Reading multiple csv files from S3 bucket with boto3

Question

I need to read multiple csv files from S3 bucket with boto3 in python and finally combine those files in single dataframe in pandas.

I am able to read single file from following script in python

 s3 = boto3.resource('s3')
 bucket = s3.Bucket('test-bucket')
 for obj in bucket.objects.all():
    key = obj.key
    body = obj.get()['Body'].read()

Following is my path

 files/splittedfiles/Code-345678

In Code-345678 I have multiple csv files which I have to read and combine it to single dataframe in pandas

Also, how do I pass a list of selected Codes as a list,so that it will read those folders only. e.g.

files/splittedfiles/Code-345678
files/splittedfiles/Code-345679
files/splittedfiles/Code-345680
files/splittedfiles/Code-345681
files/splittedfiles/Code-345682

From above I need to read files under following codes only.

345678,345679,345682

How can I do it in python?

vielkind · Accepted Answer

The boto3 API does not support reading multiple objects at once. What you can do is retrieve all objects with a specified prefix and load each of the returned objects with a loop. To do this you can use the filter() method and set the Prefix parameter to the prefix of the objects you want to load. Below I've made this simple change to your code that will let you get all the objects with the prefix "files/splittedfiles/Code-345678" that you can read by looping through those objects where you can load each file into a DataFrame:

s3 = boto3.resource('s3')
bucket = s3.Bucket('test-bucket')
prefix_objs = bucket.objects.filter(Prefix="files/splittedfiles/Code-345678")
for obj in prefix_objs:
    key = obj.key
    body = obj.get()['Body'].read()

If you have multiple prefixes you are going to want to evaluate you can take the above and turn it into a function where the prefix is a parameter then combine the results together. The function could like something like this:

import pandas as pd

def read_prefix_to_df(prefix):
    s3 = boto3.resource('s3')
    bucket = s3.Bucket('test-bucket')
    prefix_objs = bucket.objects.filter(Prefix=prefix)
    prefix_df = []
    for obj in prefix_objs:
        key = obj.key
        body = obj.get()['Body'].read()
        df = pd.DataFrame(body)
        prefix_df.append(df)
    return pd.concat(prefix_df)

Then you can iteratively apply this function to each prefix and combine the results in the end.

Yash M · Answer

Modifying Answer 1 to overcome error DataFrame constructor not properly called!

Code:

import boto3
import pandas as pd
import io

s3 = boto3.resource('s3')
bucket = s3.Bucket('bucket_name')
prefix_objs = bucket.objects.filter(Prefix="folder_path/prefix")

prefix_df = []

for obj in prefix_objs:
    key = obj.key
    body = obj.get()['Body'].read()
    temp = pd.read_csv(io.BytesIO(body), encoding='utf8')        
    prefix_df.append(temp)

Jørgen Frøland · Answer

Can you do it like this, using "filter" instead of "all":

for obj in bucket.objects.filter(Prefix='files/splittedfiles/'):
    key = obj.key
    body = obj.get()['Body'].read()

Reading multiple csv files from S3 bucket with boto3

Tags:

python

csv

amazon-s3

boto3

Neil

3 Answers

vielkind

Yash M

Jørgen Frøland

Recent Activity

Donate For Us

Reading multiple csv files from S3 bucket with boto3

Tags:

python

csv

amazon-s3

boto3

Neil

3 Answers

vielkind

Yash M

Jørgen Frøland

Related questions

Recent Activity

Donate For Us