Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading a large csv from a S3 bucket using python pandas in AWS Sagemaker

I'm trying to load a large CSV (~5GB) into pandas from S3 bucket.

Following is the code I tried for a small CSV of 1.4 kb :

client = boto3.client('s3') 
obj = client.get_object(Bucket='grocery', Key='stores.csv')
body = obj['Body']
csv_string = body.read().decode('utf-8')
df = pd.read_csv(StringIO(csv_string))

This works well for a small CSV, but my requirement of loading a 5GB csv to pandas dataframe cannot be achieved through this (probably due to memory constraints when loading the csv by StringIO).

I also tried below code

s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucket', Key='key')
df = pd.read_csv(obj['Body'])

but this gives below error.

ValueError: Invalid file path or buffer object type: <class 'botocore.response.StreamingBody'>

Any help to resolve this error is much appreciated.

like image 375
Dileepa Jayakody Avatar asked Jan 05 '18 09:01

Dileepa Jayakody


People also ask

How do I transfer data from S3 bucket to SageMaker?

Is there a way to upload the data to S3 from SageMaker? One way to solve this would be to save the CSV to the local storage on the SageMaker notebook instance, and then use the S3 API's via boto3 to upload the file as an s3 object. S3 docs for upload_file() available here.


1 Answers

I know this is quite late but here is an answer:

import boto3
bucket='sagemaker-dileepa' # Or whatever you called your bucket
data_key = 'data/stores.csv' # Where the file is within your bucket
data_location = 's3://{}/{}'.format(bucket, data_key)
df = pd.read_csv(data_location)
like image 159
mish1818 Avatar answered Sep 20 '22 10:09

mish1818