To read a parquet file into multiple partitions, it should be stored using row groups (see How to read a single large parquet file into multiple partitions using dask/dask-cudf?). The pandas documentation describes partitioning of columns, the pyarrow documentation describes how to write multiple row groups. Using the pandas DataFrame .to_parquet
method, can I interface the ability to write multiple row groups, or will it always write to a single partition? If yes, how?
Although the dataset is small (currently only 3 GB), I want to read into multiple partitions such that subsequent processing using dask will use multiple cores (I can repartition, but this creates additional overhead) (and I might work with datasets of some 10s of GB later, still small but too large for RAM).
You can simply provide the keyword argument row_group_size
when using pyarrow. Note that pyarrow is the default engine.
df.to_parquet("filename.parquet", row_group_size=500, engine="pyarrow")
alternative answer for folks using fastparquet instead of pyarrow. fastparquet provides su functionality via a differently named parameter row_group_offsets
df.to_parquet("filename.parquet", row_group_offsets=500, engine='fastparquet')
From the documentation on row_groups_offsets (int or list of int
:
If int, row-groups will be approximately this many rows, rounded down to make row groups about the same size; If a list, the explicit index values to start new row groups; If None, set to 50_000_000. In case of partitioning the data, final row-groups size can be reduced significantly further by the partitioning, occuring as a subsequent step.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With