Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading a file from a private S3 bucket to a pandas dataframe

I'm trying to read a CSV file from a private S3 bucket to a pandas dataframe:

df = pandas.read_csv('s3://mybucket/file.csv') 

I can read a file from a public bucket, but reading a file from a private bucket results in HTTP 403: Forbidden error.

I have configured the AWS credentials using aws configure.

I can download a file from a private bucket using boto3, which uses aws credentials. It seems that I need to configure pandas to use AWS credentials, but don't know how.

like image 643
IgorK Avatar asked Mar 04 '16 18:03

IgorK


People also ask

How do I access private S3 bucket files?

By default, all S3 buckets are private and can be accessed only by users who are explicitly granted access. Restrict access to your S3 buckets or objects by doing the following: Writing IAM user policies that specify the users that can access specific buckets and objects.

Can I read S3 file without downloading?

Reading objects without downloading them Similarly, if you want to upload and read small pieces of textual data such as quotes, tweets, or news articles, you can do that using the S3 resource method put(), as demonstrated in the example below (Gist).

How to read files from S3 bucket using Python Lambda?

We will use boto3 apis to read files from S3 bucket. Read a file from S3 using Python Lambda Function. List and read all files from a specific S3 prefix using Python Lambda Function. Login to AWS account and Navigate to AWS Lambda Service.

How to import pandas data into AWS S3 bucket?

import pandas as pd df = pd.DataFrame () # df.to_csv ("s3://<bucket_name>/<obj_key>") # In your case df.to_csv ("s3://info/test.csv") NOTE: You need to create bucket on aws s3 first. small notice. to make this work s3fs package should be installed.

How to read and read CSV file from S3 into spark dataframe?

Write & Read CSV file from S3 into DataFrame 1 Amazon S3 bucket and dependency. ... 2 Spark Read CSV file from S3 into DataFrame. ... 3 Reading CSV files with a user-specified custom schema. ... 4 Write Spark DataFrame to S3 in CSV file format. ... More items...

Why does Pandas need s3fs?

The reason is that we directly use boto3 and pandas in our code, but we won’t use the s3fs directly. Still, pandas needs it to connect with Amazon S3 under-the-hood. pandas now uses s3fs for handling S3 connections. This shouldn’t break any code.


2 Answers

Pandas uses boto (not boto3) inside read_csv. You might be able to install boto and have it work correctly.

There's some troubles with boto and python 3.4.4 / python3.5.1. If you're on those platforms, and until those are fixed, you can use boto 3 as

import boto3 import pandas as pd  s3 = boto3.client('s3') obj = s3.get_object(Bucket='bucket', Key='key') df = pd.read_csv(obj['Body']) 

That obj had a .read method (which returns a stream of bytes), which is enough for pandas.

like image 87
TomAugspurger Avatar answered Sep 28 '22 02:09

TomAugspurger


Updated for Pandas 0.20.1

Pandas now uses s3fs to handle s3 coonnections. link

pandas now uses s3fs for handling S3 connections. This shouldn’t break any code. However, since s3fs is not a required dependency, you will need to install it separately, like boto in prior versions of pandas.

import os  import pandas as pd from s3fs.core import S3FileSystem  # aws keys stored in ini file in same path # refer to boto3 docs for config settings os.environ['AWS_CONFIG_FILE'] = 'aws_config.ini'  s3 = S3FileSystem(anon=False) key = 'path\to\your-csv.csv' bucket = 'your-bucket-name'  df = pd.read_csv(s3.open('{}/{}'.format(bucket, key), mode='rb')) # or with f-strings df = pd.read_csv(s3.open(f'{bucket}/{key}', mode='rb')) 
like image 39
spitfiredd Avatar answered Sep 28 '22 02:09

spitfiredd