Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generating parquet files - differences between R and Python

We have generated a parquet file in Dask (Python) and with Drill (R using the Sergeant packet ). We have noticed a few issues:

  1. The format of the Dask (i.e. fastparquet) has a _metadata and a _common_metadata files while the parquet file in R \ Drill does not have these files and have parquet.crc files instead (which can be deleted). what is the difference between these parquet implementations?
like image 714
skibee Avatar asked May 16 '26 13:05

skibee


1 Answers

(only answering to 1), please post separate questions to make it easier to answer)

_metadata and _common_metadata are helper files that are not required for a Parquet dataset, these ones are used by Spark/Dask/Hive/... to infer the metadata of all Parquet files of a dataset without the need to read the footer of all files. In constrast to this, Apache Drill generates a similar file in each folder (on demand) that contains all footers of all Parquet files. Only on the first query on a dataset all files are read, further queries will only read the file that caches all footers.

Tools using _metadata and _common_metadata should be able to leverage them to have faster execution times but not depend on them for operations. In the case that they are non-existent, the query engine then simply needs to read all footers.

like image 122
Uwe L. Korn Avatar answered May 19 '26 03:05

Uwe L. Korn



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!