I'm trying to load a large CSV (~5GB) into pandas from S3 bucket.
Following is the code I tried for a small CSV of 1.4 kb :
client = boto3.client('s3')
obj = client.get_object(Bucket='grocery', Key='stores.csv')
body = obj['Body']
csv_string = body.read().decode('utf-8')
df = pd.read_csv(StringIO(csv_string))
This works well for a small CSV, but my requirement of loading a 5GB csv to pandas dataframe cannot be achieved through this (probably due to memory constraints when loading the csv by StringIO).
I also tried below code
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucket', Key='key')
df = pd.read_csv(obj['Body'])
but this gives below error.
ValueError: Invalid file path or buffer object type: <class 'botocore.response.StreamingBody'>
Any help to resolve this error is much appreciated.
Is there a way to upload the data to S3 from SageMaker? One way to solve this would be to save the CSV to the local storage on the SageMaker notebook instance, and then use the S3 API's via boto3 to upload the file as an s3 object. S3 docs for upload_file() available here.
I know this is quite late but here is an answer:
import boto3
bucket='sagemaker-dileepa' # Or whatever you called your bucket
data_key = 'data/stores.csv' # Where the file is within your bucket
data_location = 's3://{}/{}'.format(bucket, data_key)
df = pd.read_csv(data_location)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With