How to read a parquet file on s3 using dask
and specific AWS profile (stored in a credentials file). Dask uses s3fs
which uses boto
. This is what I have tried:
>>>import os
>>>import s3fs
>>>import boto3
>>>import dask.dataframe as dd
>>>os.environ['AWS_SHARED_CREDENTIALS_FILE'] = "~/.aws/credentials"
>>>fs = s3fs.S3FileSystem(anon=False,profile_name="some_user_profile")
>>>fs.exists("s3://some.bucket/data/parquet/somefile")
True
>>>df = dd.read_parquet('s3://some.bucket/data/parquet/somefile')
NoCredentialsError: Unable to locate credentials
For an introduction to the format by the standard authority see, Apache Parquet Documentation Overview . You can use AWS Glue to read Parquet files from Amazon S3 and from streaming sources as well as write Parquet files to Amazon S3. You can read and write bzip and gzip archives containing Parquet files from S3.
You can also get Amazon S3 inventory reports in Parquet or ORC format. Amazon S3 inventory gives you a flat file list of your objects and metadata. You can get the S3 inventory for CSV, ORC or Parquet formats.
Never mind, that was easy, but did not find any reference online, so here it is:
>>>import os
>>>import dask.dataframe as dd
>>>os.environ['AWS_SHARED_CREDENTIALS_FILE'] = "/path/to/credentials"
>>>df = dd.read_parquet('s3://some.bucket/data/parquet/somefile',
storage_options={"profile_name":"some_user_profile"})
>>>df.head()
# works
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With