Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I read a csv stored in S3 with csv.DictReader?

I have code that fetches an AWS S3 object. How do I read this StreamingBody with Python's csv.DictReader?

import boto3, csv

session = boto3.session.Session(aws_access_key_id=<>, aws_secret_access_key=<>, region_name=<>)
s3_resource = session.resource('s3')
s3_object = s3_resource.Object(<bucket>, <key>)
streaming_body = s3_object.get()['Body']

#csv.DictReader(???)
like image 615
Jon Avatar asked Feb 18 '17 07:02

Jon


People also ask

What is the difference between CSV reader and CSV DictReader?

csv. Reader() allows you to access CSV data using indexes and is ideal for simple CSV files. csv. DictReader() on the other hand is friendlier and easy to use, especially when working with large CSV files.

What does CSV DictReader return?

The csv. DictReader() returned an OrderedDict type for each row. That's why we used dict() to convert each row to a dictionary. Notice that we have explicitly used the dict() method to create dictionaries inside the for loop.

What is CSV DictReader?

CSV, or "comma-separated values", is a common file format for data. The csv module helps you to elegantly process data stored within a CSV file. Also see the csv documentation. This guide uses the following example file, people.


1 Answers

The code would be something like this:

import boto3
import csv

# get a handle on s3
s3 = boto3.resource(u's3')

# get a handle on the bucket that holds your file
bucket = s3.Bucket(u'bucket-name')

# get a handle on the object you want (i.e. your file)
obj = bucket.Object(key=u'test.csv')

# get the object
response = obj.get()

# read the contents of the file and split it into a list of lines

# for python 2:
lines = response[u'Body'].read().split()

# for python 3 you need to decode the incoming bytes:
lines = response['Body'].read().decode('utf-8').split()

# now iterate over those lines
for row in csv.DictReader(lines):

    # here you get a sequence of dicts
    # do whatever you want with each line here
    print(row)

You can compact this a bit in actual code, but I tried to keep it step-by-step to show the object hierarchy with boto3.

Edit Per your comment about avoiding reading the entire file into memory: I haven't run into that requirement so cant speak authoritatively, but I would try wrapping the stream so I could get a text file-like iterator. For example you could use the codecs library to replace the csv parsing section above with something like:

for row in csv.DictReader(codecs.getreader('utf-8')(response[u'Body'])):
    print(row)
like image 170
gary Avatar answered Oct 08 '22 05:10

gary