Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - How to read CSV file retrieved from S3 bucket?

There's a CSV file in a S3 bucket that I want to parse and turn into a dictionary in Python. Using Boto3, I called the s3.get_object(<bucket_name>, <key>) function and that returns a dictionary which includes a "Body" : StreamingBody() key-value pair that apparently contains the data I want.

In my python file, I've added import csv and the examples I see online on how to read a csv file, you pass the file name such as:

with open(<csv_file_name>, mode='r') as file:
reader = csv.reader(file)

However, I'm not sure how to retrieve the csv file name from StreamBody, if that's even possible. If not, is there a better way for me to read the csv file in Python? Thanks!

Edit: Wanted to add that I'm doing this in AWS Lambda and there are documented issues with using pandas in Lambda, so this is why I wanted to use the csv library and not pandas.

like image 875
Louis Avatar asked Oct 25 '17 22:10

Louis


People also ask

How do I query a csv file in S3?

To start with, open S3 in your AWS account console and create/select a bucket that has an already existing csv/Json file in it. Now, click on Actions and select Query with S3 Select.


2 Answers

csv.reader does not require a file. It can use anything that iterates through lines, including files and lists.

So you don't need a filename. Just pass the lines from response['Body'] directly into the reader. One way to do that is

lines = response['Body'].read().splitlines(True)
reader = csv.reader(lines)
like image 119
Aaron Bentley Avatar answered Oct 17 '22 15:10

Aaron Bentley


To retrieve and read CSV file from s3 bucket, you can use the following code:

import csv
import boto3
from django.conf import settings

bucket_name = "your-bucket-name"
file_name = "your-file-name-exists-in-that-bucket.csv"

s3 = boto3.resource('s3', aws_access_key_id=settings.AWS_ACCESS_KEY_ID,
                    aws_secret_access_key=settings.AWS_SECRET_ACCESS_KEY)

bucket = s3.Bucket(bucket_name)

obj = bucket.Object(key=file_name)

response = obj.get()
lines = response['Body'].read().decode('utf-8').splitlines(True)

reader = csv.DictReader(lines)
for row in reader:
    # csv_header_key is the header keys which you have defined in your csv header
    print(row['csv_header_key1'], row['csv_header_key2')
like image 41
Chirag Kalal Avatar answered Oct 17 '22 17:10

Chirag Kalal