I'm trying to write a Pandas dataframe to a partitioned file:
df.to_parquet('output.parquet', engine='pyarrow', partition_cols = ['partone', 'partwo'])
TypeError: __cinit__() got an unexpected keyword argument 'partition_cols'
From the documentation I expected that the partition_cols
would be passed as a kwargs to the pyarrow library. How can a partitioned file be written to local disk using pandas?
An ORC or Parquet file contains data columns. To these files you can add partition columns at write time. The data files do not store values for partition columns; instead, when writing the files you divide them into groups (partitions) based on column values.
Yup, quite possible to write a pandas dataframe to the binary parquet format. Some additional libraries are required like pyarrow and fastparquet .
Pandas DataFrame: to_parquet() function The to_parquet() function is used to write a DataFrame to the binary parquet format. This function writes the dataframe as a parquet file. File path or Root Directory path. Will be used as Root Directory path while writing a partitioned dataset.
Pandas DataFrame.to_parquet
is a thin wrapper over table = pa.Table.from_pandas(...)
and pq.write_table(table, ...)
(see pandas.parquet.py#L120
), and pq.write_table
does not support writing partitioned datasets. You should use pq.write_to_dataset
instead.
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
df = pd.DataFrame(yourData)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(
table,
root_path='output.parquet',
partition_cols=['partone', 'parttwo'],
)
For more info, see pyarrow documentation.
In general, I would always use the PyArrow API directly when reading / writing parquet files, since the Pandas wrapper is rather limited in what it can do.
First make sure that you have a reasonably recent version of pandas and pyarrow:
pyenv shell 3.8.2
python -m venv venv
source venv/bin/activate
pip install pandas pyarrow
pip freeze | grep pandas # pandas==1.2.3
pip freeze | grep pyarrow # pyarrow==3.0.0
Then you can use partition_cols
to produce the partitioned parquet files:
import pandas as pd
# example dataframe with 3 rows and columns year,month,day,value
df = pd.DataFrame(data={'year': [2020, 2020, 2021],
'month': [1,12,2],
'day': [1,31,28],
'value': [1000,2000,3000]})
df.to_parquet('./mydf', partition_cols=['year', 'month', 'day'])
This produces:
mydf/year=2020/month=1/day=1/6f0258e6c48a48dbb56cae0494adf659.parquet
mydf/year=2020/month=12/day=31/cf8a45116d8441668c3a397b816cd5f3.parquet
mydf/year=2021/month=2/day=28/7f9ba3f37cb9417a8689290d3f5f9e6e.parquet
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With