Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading multiple csv files from S3 bucket with boto3

I need to read multiple csv files from S3 bucket with boto3 in python and finally combine those files in single dataframe in pandas.

I am able to read single file from following script in python

 s3 = boto3.resource('s3')
 bucket = s3.Bucket('test-bucket')
 for obj in bucket.objects.all():
    key = obj.key
    body = obj.get()['Body'].read()

Following is my path

 files/splittedfiles/Code-345678

In Code-345678 I have multiple csv files which I have to read and combine it to single dataframe in pandas

Also, how do I pass a list of selected Codes as a list,so that it will read those folders only. e.g.

files/splittedfiles/Code-345678
files/splittedfiles/Code-345679
files/splittedfiles/Code-345680
files/splittedfiles/Code-345681
files/splittedfiles/Code-345682

From above I need to read files under following codes only.

345678,345679,345682

How can I do it in python?

like image 683
Neil Avatar asked Oct 17 '18 12:10

Neil


3 Answers

The boto3 API does not support reading multiple objects at once. What you can do is retrieve all objects with a specified prefix and load each of the returned objects with a loop. To do this you can use the filter() method and set the Prefix parameter to the prefix of the objects you want to load. Below I've made this simple change to your code that will let you get all the objects with the prefix "files/splittedfiles/Code-345678" that you can read by looping through those objects where you can load each file into a DataFrame:

s3 = boto3.resource('s3')
bucket = s3.Bucket('test-bucket')
prefix_objs = bucket.objects.filter(Prefix="files/splittedfiles/Code-345678")
for obj in prefix_objs:
    key = obj.key
    body = obj.get()['Body'].read()

If you have multiple prefixes you are going to want to evaluate you can take the above and turn it into a function where the prefix is a parameter then combine the results together. The function could like something like this:

import pandas as pd

def read_prefix_to_df(prefix):
    s3 = boto3.resource('s3')
    bucket = s3.Bucket('test-bucket')
    prefix_objs = bucket.objects.filter(Prefix=prefix)
    prefix_df = []
    for obj in prefix_objs:
        key = obj.key
        body = obj.get()['Body'].read()
        df = pd.DataFrame(body)
        prefix_df.append(df)
    return pd.concat(prefix_df)

Then you can iteratively apply this function to each prefix and combine the results in the end.

like image 66
vielkind Avatar answered Nov 20 '22 19:11

vielkind


Modifying Answer 1 to overcome error DataFrame constructor not properly called!

Code:

import boto3
import pandas as pd
import io

s3 = boto3.resource('s3')
bucket = s3.Bucket('bucket_name')
prefix_objs = bucket.objects.filter(Prefix="folder_path/prefix")

prefix_df = []

for obj in prefix_objs:
    key = obj.key
    body = obj.get()['Body'].read()
    temp = pd.read_csv(io.BytesIO(body), encoding='utf8')        
    prefix_df.append(temp)
like image 12
Yash M Avatar answered Nov 20 '22 18:11

Yash M


Can you do it like this, using "filter" instead of "all":

for obj in bucket.objects.filter(Prefix='files/splittedfiles/'):
    key = obj.key
    body = obj.get()['Body'].read()
like image 1
Jørgen Frøland Avatar answered Nov 20 '22 18:11

Jørgen Frøland