Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a performance difference for these two Hive queries joining two tables and filtering on a partition key?

Tags:

hql

hive

Suppose both table A and B have ds as their partition key.

Method 1

SELECT * 
FROM A JOIN B ON A.userid=B.userid 
WHERE A.ds='2014-01-01' AND B.ds='2014-01-01'

Method 2

SELECT * 
FROM (
SELECT * FROM A WHERE A.ds='2014-01-01'
) JOIN (
SELECT * FROM B WHERE B.ds='2014-01-01'
) ON
A.userid=B.userid 

Will the 2nd query be faster?

I'm wondering how WHERE and JOIN works in Hive. Is a where clause applied to the source table before the join when possible (as long as the clause contains only one table alias, like the ones above), or is it always applied only after the tables are joined (e.g. A.userid > B.userid has to be applied after the join)?

like image 709
Xiangpeng Zhao Avatar asked Jan 16 '14 00:01

Xiangpeng Zhao


1 Answers

Your question is actually about predicate pushdown in hive.
Well in the case above the execution will be exactly the same as hive will push the predicate A.ds='2014-01-01' AND B.ds='2014-01-01' to the mappers before the join.

In a more general case, the JOIN (inner join) is actually pretty easy and can be summed up to :
If it can push, it will push.
It can push the predicate when there is only one table involved (where a.x > 1) and can't push when there is more than 1 table involved (A.userid > B.userid) as the mapper reads a split of one of the tables only..

The more complex part is OUTER JOIN and furtunelty explained very clearly here.

P.S.
predicate pushdown is controlled by hive.optimize.ppd which is true by default.

like image 96
dimamah Avatar answered Sep 17 '22 15:09

dimamah