Does parquet's predicate pushdown mean that only the data that is required is actually loaded from disk?
E.g. If I create a spark dataframe and only select
particular fields, will only those fields be read from disk?
Predicate pushdown deals with what values will be scanned and not what columns. So, if you apply filter on column A to only return records with value V, the predicate push down will make parquet read only blocks that may contain values V.
A predicate push down filters the data in the database query, reducing the number of entries retrieved from the database and improving query performance. By default the Spark Dataset API will automatically push down valid WHERE clauses to the database.
Parquet allows for predicate pushdown filtering, a form of query pushdown because the file footer stores row-group level metadata for each column in the file.
Predicate pushdown is a data processing technique taking user-defined filters and executing them while reading the data. Apache Spark already supported it for Apache Parquet and RDBMS. Starting from Apache Spark 3.1. 1, you can also use them for Apache Avro, JSON and CSV formats!
Predicate pushdown deals with what values will be scanned and not what columns. So, if you apply filter on column A to only return records with value V, the predicate push down will make parquet read only blocks that may contain values V. Parquet holds min/max statistics in several levels, and it will compare the value V to the those min/max headers, and only scan blocks where min/max contains the value V. This is for predicate push down.
Another thing with parquet is "projection pushdown" - it stores data in columns, so when your projection limits the query to certain columns, only those columns will be returned. This feature is not what is called predicate pushdown though.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With