Generating parquet files - differences between R and Python

Question

We have generated a parquet file in Dask (Python) and with Drill (R using the Sergeant packet ). We have noticed a few issues:

The format of the Dask (i.e. fastparquet) has a _metadata and a _common_metadata files while the parquet file in R \ Drill does not have these files and have parquet.crc files instead (which can be deleted). what is the difference between these parquet implementations?

Uwe L. Korn · Accepted Answer

(only answering to 1), please post separate questions to make it easier to answer)

_metadata and _common_metadata are helper files that are not required for a Parquet dataset, these ones are used by Spark/Dask/Hive/... to infer the metadata of all Parquet files of a dataset without the need to read the footer of all files. In constrast to this, Apache Drill generates a similar file in each folder (on demand) that contains all footers of all Parquet files. Only on the first query on a dataset all files are read, further queries will only read the file that caches all footers.

Tools using _metadata and _common_metadata should be able to leverage them to have faster execution times but not depend on them for operations. In the case that they are non-existent, the query engine then simply needs to read all footers.

Generating parquet files - differences between R and Python

Tags:

r

parquet

dask

fastparquet

apache-drill

skibee

1 Answers

Uwe L. Korn

Recent Activity

Donate For Us

Generating parquet files - differences between R and Python

Tags:

r

parquet

dask

fastparquet

apache-drill

skibee

1 Answers

Uwe L. Korn

Related questions

Recent Activity

Donate For Us