How to write a partitioned Parquet file using Pandas

Tags:

I'm trying to write a Pandas dataframe to a partitioned file:

df.to_parquet('output.parquet', engine='pyarrow', partition_cols = ['partone', 'partwo'])

TypeError: __cinit__() got an unexpected keyword argument 'partition_cols'

From the documentation I expected that the partition_cols would be passed as a kwargs to the pyarrow library. How can a partitioned file be written to local disk using pandas?

227

asked Oct 22 '18 16:10

Ivan

2 Answers

Pandas DataFrame.to_parquet is a thin wrapper over table = pa.Table.from_pandas(...) and pq.write_table(table, ...) (see pandas.parquet.py#L120), and pq.write_table does not support writing partitioned datasets. You should use pq.write_to_dataset instead.

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame(yourData)
table = pa.Table.from_pandas(df)

pq.write_to_dataset(
    table,
    root_path='output.parquet',
    partition_cols=['partone', 'parttwo'],
)

For more info, see pyarrow documentation.

In general, I would always use the PyArrow API directly when reading / writing parquet files, since the Pandas wrapper is rather limited in what it can do.

114

answered Oct 06 '22 00:10

ostrokach

First make sure that you have a reasonably recent version of pandas and pyarrow:

pyenv shell 3.8.2
python -m venv venv
source venv/bin/activate
pip install pandas pyarrow
pip freeze | grep pandas # pandas==1.2.3
pip freeze | grep pyarrow # pyarrow==3.0.0

Then you can use partition_cols to produce the partitioned parquet files:

import pandas as pd

# example dataframe with 3 rows and columns year,month,day,value
df = pd.DataFrame(data={'year':  [2020, 2020, 2021],
                        'month': [1,12,2], 
                        'day':   [1,31,28], 
                        'value': [1000,2000,3000]})

df.to_parquet('./mydf', partition_cols=['year', 'month', 'day'])

This produces:

mydf/year=2020/month=1/day=1/6f0258e6c48a48dbb56cae0494adf659.parquet
mydf/year=2020/month=12/day=31/cf8a45116d8441668c3a397b816cd5f3.parquet
mydf/year=2021/month=2/day=28/7f9ba3f37cb9417a8689290d3f5f9e6e.parquet

answered Oct 05 '22 22:10

RubenLaguna

Related questions
                            
                                NLTK: Package Errors? punkt and pickle?
                            
                                Find exact match in list of strings
                            
                                Concatenate strings at the same indexes in two lists
                            
                                Is OrderedDict a tree? [duplicate]
                            
                                How does Rounding in Python work?
                            
                                Better way for concatenating two sorted list of integers
                            
                                Implementing 3-way quicksort
                            
                                Python: No value for argument 'filenames' in unbound method call
                            
                                Database connection string parsing in python
                            
                                Paging/scrolling through set of 2D heat maps in matplotlib
                            
                                How is the __contains__ method of the list class in Python implemented?
                            
                                Create an array where each element stores its indices
                            
                                Installing Tensorflow - not supported wheel
                            
                                Get a Bright Random Colour Python
                            
                                How to generate a PDF with non-ascii characters using from_string from python-pdfkit
                            
                                Concatenating Strings from a List of Objects
                            
                                must be of the form 'app_label.ModelName'." % model ValueError: Invalid model reference
                            
                                Pandas: count empty strings in a column
                            
                                How to read keras model weights without a model
                            
                                Python plot two lists with different length

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to write a partitioned Parquet file using Pandas

Tags:

python

pandas

parquet

pyarrow

Ivan

People also ask

2 Answers

ostrokach

RubenLaguna

Recent Activity

Donate For Us