I have a text file saved on S3 which is a tab delimited table. I want to load it into pandas but cannot save it first because I am running on a heroku server. Here is what I have so far.
import io import boto3 import os import pandas as pd os.environ["AWS_ACCESS_KEY_ID"] = "xxxxxxxx" os.environ["AWS_SECRET_ACCESS_KEY"] = "xxxxxxxx" s3_client = boto3.client('s3') response = s3_client.get_object(Bucket="my_bucket",Key="filename.txt") file = response["Body"] pd.read_csv(file, header=14, delimiter="\t", low_memory=False) the error is
OSError: Expected file path name or file-like object, got <class 'bytes'> type How do I convert the response body into a format pandas will accept?
pd.read_csv(io.StringIO(file), header=14, delimiter="\t", low_memory=False) returns TypeError: initial_value must be str or None, not StreamingBody pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False) returns TypeError: 'StreamingBody' does not support the buffer interface UPDATE - Using the following worked
file = response["Body"].read() and
pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False)
Reading objects without downloading them Similarly, if you want to upload and read small pieces of textual data such as quotes, tweets, or news articles, you can do that using the S3 resource method put(), as demonstrated in the example below (Gist).
You can use the Boto3 Session and bucket. copy() method to copy files between S3 buckets. You need your AWS account credentials for performing copy or move operations.
You can write pandas dataframe as CSV directly to S3 using the df. to_csv(s3URI, storage_options).
pandas uses boto for read_csv, so you should be able to:
import boto data = pd.read_csv('s3://bucket....csv') If you need boto3 because you are on python3.4+, you can
import boto3 import io s3 = boto3.client('s3') obj = s3.get_object(Bucket='bucket', Key='key') df = pd.read_csv(io.BytesIO(obj['Body'].read())) Since version 0.20.1 pandas uses s3fs, see answer below.
Now pandas can handle S3 URLs. You could simply do:
import pandas as pd import s3fs df = pd.read_csv('s3://bucket-name/file.csv') You need to install s3fs if you don't have it. pip install s3fs
If your S3 bucket is private and requires authentication, you have two options:
1- Add access credentials to your ~/.aws/credentials config file
[default] aws_access_key_id=AKIAIOSFODNN7EXAMPLE aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Or
2- Set the following environment variables with their proper values:
aws_access_key_idaws_secret_access_keyaws_session_tokenIf you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With