Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read parquet file from s3 using dask with specific AWS profile

How to read a parquet file on s3 using dask and specific AWS profile (stored in a credentials file). Dask uses s3fs which uses boto. This is what I have tried:

>>>import os
>>>import s3fs
>>>import boto3
>>>import dask.dataframe as dd

>>>os.environ['AWS_SHARED_CREDENTIALS_FILE'] = "~/.aws/credentials"

>>>fs = s3fs.S3FileSystem(anon=False,profile_name="some_user_profile")
>>>fs.exists("s3://some.bucket/data/parquet/somefile")
True
>>>df = dd.read_parquet('s3://some.bucket/data/parquet/somefile')
NoCredentialsError: Unable to locate credentials
like image 272
muon Avatar asked Jan 22 '18 20:01

muon


People also ask

How do I read AWS Parquet files?

For an introduction to the format by the standard authority see, Apache Parquet Documentation Overview . You can use AWS Glue to read Parquet files from Amazon S3 and from streaming sources as well as write Parquet files to Amazon S3. You can read and write bzip and gzip archives containing Parquet files from S3.

Does S3 support Parquet?

You can also get Amazon S3 inventory reports in Parquet or ORC format. Amazon S3 inventory gives you a flat file list of your objects and metadata. You can get the S3 inventory for CSV, ORC or Parquet formats.


1 Answers

Never mind, that was easy, but did not find any reference online, so here it is:

>>>import os
>>>import dask.dataframe as dd
>>>os.environ['AWS_SHARED_CREDENTIALS_FILE'] = "/path/to/credentials"

>>>df = dd.read_parquet('s3://some.bucket/data/parquet/somefile',
                      storage_options={"profile_name":"some_user_profile"})
>>>df.head()
# works
like image 74
muon Avatar answered Oct 19 '22 04:10

muon