Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

filter pushdown using spark-sql on map type column in parquet

I am trying to store my data in nested way in parquet and using map type column to store complex objects as values.

If somebody could let me know whether filter push down works on map type of columns or not.For example below is my sql query -

`select measureMap['CR01'].tenorMap['1M'] from RiskFactor where businessDate='2016-03-14' and bookId='FI-UK'`

measureMap is a map with key as String and value as a custom data type containing 2 attributes - String and another map of String,Double pair.

I want to know whether pushdown will work on map or not i.e if map has 10 key value pairs , Spark will bring whole map's data in memort and create the object model or it will filter out the data depending upon the key at I/O read level.

Also I want ot know is there is any way to specify key in where clause, something like - where measureMap.key = 'CR01' ?

like image 767
Vijayendra Bhati Avatar asked Jun 21 '16 08:06

Vijayendra Bhati


People also ask

What is filter pushdown in Spark?

A predicate push down filters the data in the database query, reducing the number of entries retrieved from the database and improving query performance. By default the Spark Dataset API will automatically push down valid WHERE clauses to the database.

Does parquet support predicate pushdown?

Parquet allows for predicate pushdown filtering, a form of query pushdown because the file footer stores row-group level metadata for each column in the file.

Which option can be used in Spark SQL if you need to use an in memory columnar structure to cache tables?

Spark SQL can cache tables using an in-memory columnar format by calling spark. catalog. cacheTable("tableName") or dataFrame. cache() .

What is PushDownPredicate?

PushDownPredicate is a base logical optimization that removes (eliminates) View logical operators from a logical query plan. PushDownPredicate is part of the Operator Optimization before Inferring Filters fixed-point batch in the standard batches of the Catalyst Optimizer.


1 Answers

The short answer is No. Parquet predicate pushdown doesn't work with mapType columns or for the nested parquet structure.
Spark catalyst optimizer only understands the top level column in the parquet data. It uses the column type, column data range, encoding etc to finally generate the whole stage code for the query.
When the data is in a MapType format it is not possible to get this information from the column. You could have hundreds of key-value pair inside a map which is impossible with current spark infrastructure to do a predicate pushdown.

like image 187
Avishek Bhattacharya Avatar answered Sep 27 '22 23:09

Avishek Bhattacharya