Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to import a text file on AWS S3 into pandas without writing to disk

I have a text file saved on S3 which is a tab delimited table. I want to load it into pandas but cannot save it first because I am running on a heroku server. Here is what I have so far.

import io import boto3 import os import pandas as pd  os.environ["AWS_ACCESS_KEY_ID"] = "xxxxxxxx" os.environ["AWS_SECRET_ACCESS_KEY"] = "xxxxxxxx"  s3_client = boto3.client('s3') response = s3_client.get_object(Bucket="my_bucket",Key="filename.txt") file = response["Body"]   pd.read_csv(file, header=14, delimiter="\t", low_memory=False) 

the error is

OSError: Expected file path name or file-like object, got <class 'bytes'> type 

How do I convert the response body into a format pandas will accept?

pd.read_csv(io.StringIO(file), header=14, delimiter="\t", low_memory=False)  returns  TypeError: initial_value must be str or None, not StreamingBody  pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False)  returns  TypeError: 'StreamingBody' does not support the buffer interface 

UPDATE - Using the following worked

file = response["Body"].read() 

and

pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False) 
like image 906
alpalalpal Avatar asked Jun 08 '16 13:06

alpalalpal


People also ask

Can I read S3 file without downloading?

Reading objects without downloading them Similarly, if you want to upload and read small pieces of textual data such as quotes, tweets, or news articles, you can do that using the S3 resource method put(), as demonstrated in the example below (Gist).

How do I transfer files from S3 bucket to S3 bucket in Python?

You can use the Boto3 Session and bucket. copy() method to copy files between S3 buckets. You need your AWS account credentials for performing copy or move operations.

Can pandas write directly to S3?

You can write pandas dataframe as CSV directly to S3 using the df. to_csv(s3URI, storage_options).


2 Answers

pandas uses boto for read_csv, so you should be able to:

import boto data = pd.read_csv('s3://bucket....csv') 

If you need boto3 because you are on python3.4+, you can

import boto3 import io s3 = boto3.client('s3') obj = s3.get_object(Bucket='bucket', Key='key') df = pd.read_csv(io.BytesIO(obj['Body'].read())) 

Since version 0.20.1 pandas uses s3fs, see answer below.

like image 111
Stefan Avatar answered Oct 25 '22 12:10

Stefan


Now pandas can handle S3 URLs. You could simply do:

import pandas as pd import s3fs  df = pd.read_csv('s3://bucket-name/file.csv') 

You need to install s3fs if you don't have it. pip install s3fs

Authentication

If your S3 bucket is private and requires authentication, you have two options:

1- Add access credentials to your ~/.aws/credentials config file

[default] aws_access_key_id=AKIAIOSFODNN7EXAMPLE aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY 

Or

2- Set the following environment variables with their proper values:

  • aws_access_key_id
  • aws_secret_access_key
  • aws_session_token
like image 27
Sam Avatar answered Oct 25 '22 14:10

Sam