Is there a performance difference for these two Hive queries joining two tables and filtering on a partition key?

Question

Suppose both table A and B have ds as their partition key.

Method 1

SELECT * 
FROM A JOIN B ON A.userid=B.userid 
WHERE A.ds='2014-01-01' AND B.ds='2014-01-01'

Method 2

SELECT * 
FROM (
SELECT * FROM A WHERE A.ds='2014-01-01'
) JOIN (
SELECT * FROM B WHERE B.ds='2014-01-01'
) ON
A.userid=B.userid

Will the 2nd query be faster?

I'm wondering how WHERE and JOIN works in Hive. Is a where clause applied to the source table before the join when possible (as long as the clause contains only one table alias, like the ones above), or is it always applied only after the tables are joined (e.g. A.userid > B.userid has to be applied after the join)?

dimamah · Accepted Answer

Your question is actually about predicate pushdown in hive.
Well in the case above the execution will be exactly the same as hive will push the predicate A.ds='2014-01-01' AND B.ds='2014-01-01' to the mappers before the join.

In a more general case, the JOIN (inner join) is actually pretty easy and can be summed up to :
If it can push, it will push.
It can push the predicate when there is only one table involved (where a.x > 1) and can't push when there is more than 1 table involved (A.userid > B.userid) as the mapper reads a split of one of the tables only..

The more complex part is OUTER JOIN and furtunelty explained very clearly here.

P.S.
predicate pushdown is controlled by hive.optimize.ppd which is true by default.

Is there a performance difference for these two Hive queries joining two tables and filtering on a partition key?

Tags:

hql

hive

Xiangpeng Zhao

1 Answers

dimamah

Recent Activity

Donate For Us

Is there a performance difference for these two Hive queries joining two tables and filtering on a partition key?

Tags:

hql

hive

Xiangpeng Zhao

1 Answers

dimamah

Related questions

Recent Activity

Donate For Us