Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to write parquet file from pandas dataframe in S3 in python

I have a pandas dataframe. i want to write this dataframe to parquet file in S3. I need a sample code for the same.I tried to google it. but i could not get a working sample code.

like image 215
Alexsander Avatar asked Nov 21 '18 16:11

Alexsander


People also ask

Can S3 store parquet?

Amazon S3 inventory gives you a flat file list of your objects and metadata. You can get the S3 inventory for CSV, ORC or Parquet formats.

How do I write pandas DataFrame to parquet?

Pandas DataFrame: to_parquet() function The to_parquet() function is used to write a DataFrame to the binary parquet format. This function writes the dataframe as a parquet file. File path or Root Directory path. Will be used as Root Directory path while writing a partitioned dataset.


1 Answers

For your reference, I have the following code works.

s3_url = 's3://bucket/folder/bucket.parquet.gzip' df.to_parquet(s3_url, compression='gzip') 

In order to use to_parquet, you need pyarrow or fastparquet to be installed. Also, make sure you have correct information in your config and credentials files, located at .aws folder.

Edit: Additionally, s3fs is needed. see https://stackoverflow.com/a/54006942/1862909

like image 76
Wai Kiat Avatar answered Sep 30 '22 13:09

Wai Kiat