Say I have a pandas DataFrame df that I would like to store on disk as dataset using pyarrow parquet, I would do this:
table = pyarrow.Table.from_pandas(df)
pyarrow.parquet.write_to_dataset(table, root_path=some_path, partition_cols=['a',])
On disk the dataset would look like something like this:
some_path
├── a=1
____├── 4498704937d84fe5abebb3f06515ab2d.parquet
├── a=2
____├── 8bcfaed8986c4bdba587aaaee532370c.parquet
Q: Is it possible for me to override the auto-assignment of the long UUID as filename somehow during the dataset writing? My purpose is to be able to overwrite the dataset on disk when I have a new version of df. Currently if I try to write the dataset again, another new uniquely named [UUID].parquet file will be placed next to the old one, with the same, redundant data.
For anyone who's also interested in the development of this issue, it is solved as of pyarrow version 0.15.0, with great thanks to the open source community (Jira issue link).
Following the example used in the question:
pyarrow.parquet.write_to_dataset(table,
some_path,
['a',],
partition_filename_cb=lambda x:'-'.join(x)+'.parquet')
would produce a saved dataset like this:
├── a=1
├── 1.parquet
├── a=2
├── 2.parquet
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With