Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Control row groups with pandas.DataFrame.to_parquet

Tags:

To read a parquet file into multiple partitions, it should be stored using row groups (see How to read a single large parquet file into multiple partitions using dask/dask-cudf?). The pandas documentation describes partitioning of columns, the pyarrow documentation describes how to write multiple row groups. Using the pandas DataFrame .to_parquet method, can I interface the ability to write multiple row groups, or will it always write to a single partition? If yes, how?

Although the dataset is small (currently only 3 GB), I want to read into multiple partitions such that subsequent processing using dask will use multiple cores (I can repartition, but this creates additional overhead) (and I might work with datasets of some 10s of GB later, still small but too large for RAM).

like image 967
gerrit Avatar asked Jan 29 '20 17:01

gerrit


2 Answers

You can simply provide the keyword argument row_group_size when using pyarrow. Note that pyarrow is the default engine.

df.to_parquet("filename.parquet", row_group_size=500, engine="pyarrow")
like image 56
JulianWgs Avatar answered Sep 30 '22 21:09

JulianWgs


alternative answer for folks using fastparquet instead of pyarrow. fastparquet provides su functionality via a differently named parameter row_group_offsets

df.to_parquet("filename.parquet", row_group_offsets=500, engine='fastparquet')

From the documentation on row_groups_offsets (int or list of int:

If int, row-groups will be approximately this many rows, rounded down to make row groups about the same size; If a list, the explicit index values to start new row groups; If None, set to 50_000_000. In case of partitioning the data, final row-groups size can be reduced significantly further by the partitioning, occuring as a subsequent step.

like image 40
Haleemur Ali Avatar answered Sep 30 '22 19:09

Haleemur Ali