I have a text file saved on S3 which is a tab delimited table. I want to load it into pandas but cannot save it first because I am running on a heroku server. Here is what I have so far. <pre class="prettyprint"><code>import io import boto3 import os import pandas as pd os.environ["AWS_ACCESS_KEY_ID"] = "xxxxxxxx" os.environ["AWS_SECRET_ACCESS_KEY"] = "xxxxxxxx" s3_client = boto3.client('s3') response = s3_client.get_object(Bucket="my_bucket",Key="filename.txt") file = response["Body"] pd.read_csv(file, header=14, delimiter="\t", low_memory=False) </code></pre> the error is <pre class="prettyprint"><code>OSError: Expected file path name or file-like object, got <class 'bytes'> type </code></pre> How do I convert the response body into a format pandas will accept? <pre class="prettyprint"><code>pd.read_csv(io.StringIO(file), header=14, delimiter="\t", low_memory=False) returns TypeError: initial_value must be str or None, not StreamingBody pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False) returns TypeError: 'StreamingBody' does not support the buffer interface </code></pre> UPDATE - Using the following worked <pre class="prettyprint"><code>file = response["Body"].read() </code></pre> and <pre class="prettyprint"><code>pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False) </code></pre>

<code>pandas</code> uses <code>boto</code> for <code>read_csv</code>, so you should be able to: <pre class="prettyprint"><code>import boto data = pd.read_csv('s3://bucket....csv') </code></pre> If you need <code>boto3</code> because you are on <code>python3.4+</code>, you can <pre class="prettyprint"><code>import boto3 import io s3 = boto3.client('s3') obj = s3.get_object(Bucket='bucket', Key='key') df = pd.read_csv(io.BytesIO(obj['Body'].read())) </code></pre> Since version 0.20.1 <code>pandas</code> uses <code>s3fs</code>, see answer below.

Now pandas can handle S3 URLs. You could simply do: <pre class="prettyprint"><code>import pandas as pd import s3fs df = pd.read_csv('s3://bucket-name/file.csv') </code></pre> You need to install <code>s3fs</code> if you don't have it. <code>pip install s3fs</code> <h3>Authentication</h3> If your S3 bucket is private and requires authentication, you have two options: 1- Add access credentials to your <code>~/.aws/credentials</code> config file <pre class="prettyprint"><code>[default] aws_access_key_id=AKIAIOSFODNN7EXAMPLE aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY </code></pre> Or 2- Set the following environment variables with their proper values: <ul> <li><code>aws_access_key_id</code></li> <li><code>aws_secret_access_key</code></li> <li><code>aws_session_token</code></li> </ul>

How to import a text file on AWS S3 into pandas without writing to disk

import io import boto3 import os import pandas as pd  os.environ["AWS_ACCESS_KEY_ID"] = "xxxxxxxx" os.environ["AWS_SECRET_ACCESS_KEY"] = "xxxxxxxx"  s3_client = boto3.client('s3') response = s3_client.get_object(Bucket="my_bucket",Key="filename.txt") file = response["Body"]   pd.read_csv(file, header=14, delimiter="\t", low_memory=False)

the error is

OSError: Expected file path name or file-like object, got <class 'bytes'> type

How do I convert the response body into a format pandas will accept?

pd.read_csv(io.StringIO(file), header=14, delimiter="\t", low_memory=False)  returns  TypeError: initial_value must be str or None, not StreamingBody  pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False)  returns  TypeError: 'StreamingBody' does not support the buffer interface

UPDATE - Using the following worked

file = response["Body"].read()

and

pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False)

906

asked Jun 08 '16 13:06

alpalalpal

2 Answers

pandas uses boto for read_csv, so you should be able to:

import boto data = pd.read_csv('s3://bucket....csv')

If you need boto3 because you are on python3.4+, you can

import boto3 import io s3 = boto3.client('s3') obj = s3.get_object(Bucket='bucket', Key='key') df = pd.read_csv(io.BytesIO(obj['Body'].read()))

Since version 0.20.1 pandas uses s3fs, see answer below.

111

answered Oct 25 '22 12:10

Stefan

Now pandas can handle S3 URLs. You could simply do:

import pandas as pd import s3fs  df = pd.read_csv('s3://bucket-name/file.csv')

You need to install s3fs if you don't have it. pip install s3fs

Authentication

If your S3 bucket is private and requires authentication, you have two options:

1- Add access credentials to your ~/.aws/credentials config file

[default] aws_access_key_id=AKIAIOSFODNN7EXAMPLE aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

2- Set the following environment variables with their proper values:

aws_access_key_id
aws_secret_access_key
aws_session_token

answered Oct 25 '22 14:10

Sam

Related questions
                            
                                Run code before and after each test in py.test?
                            
                                Why doesn't requests.get() return? What is the default timeout that requests.get() uses?
                            
                                Counting the number of non-NaN elements in a numpy ndarray in Python
                            
                                How to implement the --verbose or -v option into a script?
                            
                                How to execute ipdb.set_trace() at will while running pytest tests
                            
                                Platform independent path concatenation using "/" , "\"?
                            
                                method of iterating over sqlalchemy model's defined columns?
                            
                                Get an attribute value based on the name attribute with BeautifulSoup
                            
                                Python strip with \n [duplicate]
                            
                                Create a file if it doesn't exist
                            
                                Convert number strings with commas in pandas DataFrame to float
                            
                                Pandas df.to_csv("file.csv" encode="utf-8") still gives trash characters for minus sign
                            
                                How to generate keyboard events?
                            
                                How to create a user in Django?
                            
                                How do I convert a list into a string with spaces in Python?
                            
                                Python os.path.join on Windows
                            
                                How to apply a logical operator to all elements in a python list
                            
                                Python str vs unicode types
                            
                                Getting name of windows computer running python script?
                            
                                heapq with custom compare predicate

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to import a text file on AWS S3 into pandas without writing to disk

Tags:

python

pandas

heroku

amazon-s3

boto3

alpalalpal

People also ask

2 Answers

Stefan

Authentication

Sam

Recent Activity

Donate For Us