I am trying to store my data in nested way in parquet and using map type column to store complex objects as values.
If somebody could let me know whether filter push down works on map type of columns or not.For example below is my sql query -
`select measureMap['CR01'].tenorMap['1M'] from RiskFactor where businessDate='2016-03-14' and bookId='FI-UK'`
measureMap is a map with key as String and value as a custom data type containing 2 attributes - String and another map of String,Double pair.
I want to know whether pushdown will work on map or not i.e if map has 10 key value pairs , Spark will bring whole map's data in memort and create the object model or it will filter out the data depending upon the key at I/O read level.
Also I want ot know is there is any way to specify key in where clause, something like - where measureMap.key = 'CR01' ?
A predicate push down filters the data in the database query, reducing the number of entries retrieved from the database and improving query performance. By default the Spark Dataset API will automatically push down valid WHERE clauses to the database.
Parquet allows for predicate pushdown filtering, a form of query pushdown because the file footer stores row-group level metadata for each column in the file.
Spark SQL can cache tables using an in-memory columnar format by calling spark. catalog. cacheTable("tableName") or dataFrame. cache() .
PushDownPredicate is a base logical optimization that removes (eliminates) View logical operators from a logical query plan. PushDownPredicate is part of the Operator Optimization before Inferring Filters fixed-point batch in the standard batches of the Catalyst Optimizer.
The short answer is No. Parquet predicate pushdown doesn't work with mapType columns or for the nested parquet structure.
Spark catalyst optimizer only understands the top level column in the parquet data. It uses the column type, column data range, encoding etc to finally generate the whole stage code for the query.
When the data is in a MapType format it is not possible to get this information from the column. You could have hundreds of key-value pair inside a map which is impossible with current spark infrastructure to do a predicate pushdown.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With