I need to read multiple csv files from S3 bucket with boto3 in python and finally combine those files in single dataframe in pandas.
I am able to read single file from following script in python
s3 = boto3.resource('s3')
bucket = s3.Bucket('test-bucket')
for obj in bucket.objects.all():
key = obj.key
body = obj.get()['Body'].read()
Following is my path
files/splittedfiles/Code-345678
In Code-345678
I have multiple csv
files which I have to read and combine it to single dataframe in pandas
Also, how do I pass a list of selected Codes
as a list,so that it will read those folders only. e.g.
files/splittedfiles/Code-345678
files/splittedfiles/Code-345679
files/splittedfiles/Code-345680
files/splittedfiles/Code-345681
files/splittedfiles/Code-345682
From above I need to read files under following codes only.
345678,345679,345682
How can I do it in python?
The boto3
API does not support reading multiple objects at once. What you can do is retrieve all objects with a specified prefix and load each of the returned objects with a loop. To do this you can use the filter()
method and set the Prefix
parameter to the prefix of the objects you want to load. Below I've made this simple change to your code that will let you get all the objects with the prefix "files/splittedfiles/Code-345678" that you can read by looping through those objects where you can load each file into a DataFrame:
s3 = boto3.resource('s3')
bucket = s3.Bucket('test-bucket')
prefix_objs = bucket.objects.filter(Prefix="files/splittedfiles/Code-345678")
for obj in prefix_objs:
key = obj.key
body = obj.get()['Body'].read()
If you have multiple prefixes you are going to want to evaluate you can take the above and turn it into a function where the prefix is a parameter then combine the results together. The function could like something like this:
import pandas as pd
def read_prefix_to_df(prefix):
s3 = boto3.resource('s3')
bucket = s3.Bucket('test-bucket')
prefix_objs = bucket.objects.filter(Prefix=prefix)
prefix_df = []
for obj in prefix_objs:
key = obj.key
body = obj.get()['Body'].read()
df = pd.DataFrame(body)
prefix_df.append(df)
return pd.concat(prefix_df)
Then you can iteratively apply this function to each prefix and combine the results in the end.
Modifying Answer 1 to overcome error DataFrame constructor not properly called!
Code:
import boto3
import pandas as pd
import io
s3 = boto3.resource('s3')
bucket = s3.Bucket('bucket_name')
prefix_objs = bucket.objects.filter(Prefix="folder_path/prefix")
prefix_df = []
for obj in prefix_objs:
key = obj.key
body = obj.get()['Body'].read()
temp = pd.read_csv(io.BytesIO(body), encoding='utf8')
prefix_df.append(temp)
Can you do it like this, using "filter" instead of "all":
for obj in bucket.objects.filter(Prefix='files/splittedfiles/'):
key = obj.key
body = obj.get()['Body'].read()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With