Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to write a partitioned Parquet file using Pandas

I'm trying to write a Pandas dataframe to a partitioned file:

df.to_parquet('output.parquet', engine='pyarrow', partition_cols = ['partone', 'partwo'])

TypeError: __cinit__() got an unexpected keyword argument 'partition_cols'

From the documentation I expected that the partition_cols would be passed as a kwargs to the pyarrow library. How can a partitioned file be written to local disk using pandas?

like image 227
Ivan Avatar asked Oct 22 '18 16:10

Ivan


People also ask

Can parquet files be partitioned?

An ORC or Parquet file contains data columns. To these files you can add partition columns at write time. The data files do not store values for partition columns; instead, when writing the files you divide them into groups (partitions) based on column values.

Can pandas write to parquet?

Yup, quite possible to write a pandas dataframe to the binary parquet format. Some additional libraries are required like pyarrow and fastparquet .

How do I write a pandas DataFrame to a Parquet file?

Pandas DataFrame: to_parquet() function The to_parquet() function is used to write a DataFrame to the binary parquet format. This function writes the dataframe as a parquet file. File path or Root Directory path. Will be used as Root Directory path while writing a partitioned dataset.


2 Answers

Pandas DataFrame.to_parquet is a thin wrapper over table = pa.Table.from_pandas(...) and pq.write_table(table, ...) (see pandas.parquet.py#L120), and pq.write_table does not support writing partitioned datasets. You should use pq.write_to_dataset instead.

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame(yourData)
table = pa.Table.from_pandas(df)

pq.write_to_dataset(
    table,
    root_path='output.parquet',
    partition_cols=['partone', 'parttwo'],
)

For more info, see pyarrow documentation.

In general, I would always use the PyArrow API directly when reading / writing parquet files, since the Pandas wrapper is rather limited in what it can do.

like image 114
ostrokach Avatar answered Oct 06 '22 00:10

ostrokach


First make sure that you have a reasonably recent version of pandas and pyarrow:

pyenv shell 3.8.2
python -m venv venv
source venv/bin/activate
pip install pandas pyarrow
pip freeze | grep pandas # pandas==1.2.3
pip freeze | grep pyarrow # pyarrow==3.0.0

Then you can use partition_cols to produce the partitioned parquet files:

import pandas as pd

# example dataframe with 3 rows and columns year,month,day,value
df = pd.DataFrame(data={'year':  [2020, 2020, 2021],
                        'month': [1,12,2], 
                        'day':   [1,31,28], 
                        'value': [1000,2000,3000]})

df.to_parquet('./mydf', partition_cols=['year', 'month', 'day'])

This produces:

mydf/year=2020/month=1/day=1/6f0258e6c48a48dbb56cae0494adf659.parquet
mydf/year=2020/month=12/day=31/cf8a45116d8441668c3a397b816cd5f3.parquet
mydf/year=2021/month=2/day=28/7f9ba3f37cb9417a8689290d3f5f9e6e.parquet
like image 38
RubenLaguna Avatar answered Oct 05 '22 22:10

RubenLaguna