Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parquet predicate pushdown

Does parquet's predicate pushdown mean that only the data that is required is actually loaded from disk?

E.g. If I create a spark dataframe and only select particular fields, will only those fields be read from disk?

like image 716
jbrown Avatar asked Jan 28 '16 17:01

jbrown


People also ask

What is predicate pushdown in parquet?

Predicate pushdown deals with what values will be scanned and not what columns. So, if you apply filter on column A to only return records with value V, the predicate push down will make parquet read only blocks that may contain values V.

How does predicate pushdown work?

A predicate push down filters the data in the database query, reducing the number of entries retrieved from the database and improving query performance. By default the Spark Dataset API will automatically push down valid WHERE clauses to the database.

Which file format enables the use of predicate pushdown filtering as well as column pruning at the storage layer?

Parquet allows for predicate pushdown filtering, a form of query pushdown because the file footer stores row-group level metadata for each column in the file.

Does Avro support predicate pushdown?

Predicate pushdown is a data processing technique taking user-defined filters and executing them while reading the data. Apache Spark already supported it for Apache Parquet and RDBMS. Starting from Apache Spark 3.1. 1, you can also use them for Apache Avro, JSON and CSV formats!


1 Answers

Predicate pushdown deals with what values will be scanned and not what columns. So, if you apply filter on column A to only return records with value V, the predicate push down will make parquet read only blocks that may contain values V. Parquet holds min/max statistics in several levels, and it will compare the value V to the those min/max headers, and only scan blocks where min/max contains the value V. This is for predicate push down.

Another thing with parquet is "projection pushdown" - it stores data in columns, so when your projection limits the query to certain columns, only those columns will be returned. This feature is not what is called predicate pushdown though.

like image 76
roee Avatar answered Oct 17 '22 09:10

roee