I have a text file saved on S3 which is a tab delimited table. I want to load it into pandas but cannot save it first because I am running on a heroku server. Here is what I have so far.
import io import boto3 import os import pandas as pd os.environ["AWS_ACCESS_KEY_ID"] = "xxxxxxxx" os.environ["AWS_SECRET_ACCESS_KEY"] = "xxxxxxxx" s3_client = boto3.client('s3') response = s3_client.get_object(Bucket="my_bucket",Key="filename.txt") file = response["Body"] pd.read_csv(file, header=14, delimiter="\t", low_memory=False)
the error is
OSError: Expected file path name or file-like object, got <class 'bytes'> type
How do I convert the response body into a format pandas will accept?
pd.read_csv(io.StringIO(file), header=14, delimiter="\t", low_memory=False) returns TypeError: initial_value must be str or None, not StreamingBody pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False) returns TypeError: 'StreamingBody' does not support the buffer interface
UPDATE - Using the following worked
file = response["Body"].read()
and
pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False)
Reading objects without downloading them Similarly, if you want to upload and read small pieces of textual data such as quotes, tweets, or news articles, you can do that using the S3 resource method put(), as demonstrated in the example below (Gist).
You can use the Boto3 Session and bucket. copy() method to copy files between S3 buckets. You need your AWS account credentials for performing copy or move operations.
You can write pandas dataframe as CSV directly to S3 using the df. to_csv(s3URI, storage_options).
pandas
uses boto
for read_csv
, so you should be able to:
import boto data = pd.read_csv('s3://bucket....csv')
If you need boto3
because you are on python3.4+
, you can
import boto3 import io s3 = boto3.client('s3') obj = s3.get_object(Bucket='bucket', Key='key') df = pd.read_csv(io.BytesIO(obj['Body'].read()))
Since version 0.20.1 pandas
uses s3fs
, see answer below.
Now pandas can handle S3 URLs. You could simply do:
import pandas as pd import s3fs df = pd.read_csv('s3://bucket-name/file.csv')
You need to install s3fs
if you don't have it. pip install s3fs
If your S3 bucket is private and requires authentication, you have two options:
1- Add access credentials to your ~/.aws/credentials
config file
[default] aws_access_key_id=AKIAIOSFODNN7EXAMPLE aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Or
2- Set the following environment variables with their proper values:
aws_access_key_id
aws_secret_access_key
aws_session_token
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With