Is it possible to override the automated assignment of uuid for filenames when writing datasets with pyarrow.parquet?

Question

Say I have a pandas DataFrame df that I would like to store on disk as dataset using pyarrow parquet, I would do this:
table = pyarrow.Table.from_pandas(df) pyarrow.parquet.write_to_dataset(table, root_path=some_path, partition_cols=['a',]) On disk the dataset would look like something like this:
some_path
├── a=1
____├── 4498704937d84fe5abebb3f06515ab2d.parquet
├── a=2
____├── 8bcfaed8986c4bdba587aaaee532370c.parquet

Q: Is it possible for me to override the auto-assignment of the long UUID as filename somehow during the dataset writing? My purpose is to be able to overwrite the dataset on disk when I have a new version of df. Currently if I try to write the dataset again, another new uniquely named [UUID].parquet file will be placed next to the old one, with the same, redundant data.

ji.xu · Accepted Answer

For anyone who's also interested in the development of this issue, it is solved as of pyarrow version 0.15.0, with great thanks to the open source community (Jira issue link).

Following the example used in the question:

pyarrow.parquet.write_to_dataset(table, 
                                 some_path, 
                                 ['a',],
                                 partition_filename_cb=lambda x:'-'.join(x)+'.parquet')

would produce a saved dataset like this:

├── a=1
    ├── 1.parquet
├── a=2
    ├── 2.parquet

Is it possible to override the automated assignment of uuid for filenames when writing datasets with pyarrow.parquet?

Tags:

io

parquet

pyarrow

ji.xu

1 Answers

ji.xu

Recent Activity

Donate For Us

Is it possible to override the automated assignment of uuid for filenames when writing datasets with pyarrow.parquet?

Tags:

io

parquet

pyarrow

ji.xu

1 Answers

ji.xu

Related questions

Recent Activity

Donate For Us