Suppose both table A
and B
have ds
as their partition key.
Method 1
SELECT *
FROM A JOIN B ON A.userid=B.userid
WHERE A.ds='2014-01-01' AND B.ds='2014-01-01'
Method 2
SELECT *
FROM (
SELECT * FROM A WHERE A.ds='2014-01-01'
) JOIN (
SELECT * FROM B WHERE B.ds='2014-01-01'
) ON
A.userid=B.userid
Will the 2nd query be faster?
I'm wondering how WHERE
and JOIN
works in Hive. Is a where
clause applied to the source table before the join when possible (as long as the clause contains only one table alias, like the ones above), or is it always applied only after the tables are joined (e.g. A.userid > B.userid
has to be applied after the join)?
Your question is actually about predicate pushdown in hive.
Well in the case above the execution will be exactly the same as hive will push the predicate A.ds='2014-01-01' AND B.ds='2014-01-01'
to the mappers before the join.
In a more general case, the JOIN
(inner join) is actually pretty easy and can be summed up to :
If it can push, it will push.
It can push the predicate when there is only one table involved (where a.x > 1
) and can't push when there is more than 1 table involved (A.userid > B.userid
) as the mapper reads a split of one of the tables only..
The more complex part is OUTER JOIN
and furtunelty explained very clearly here.
P.S.
predicate pushdown is controlled by hive.optimize.ppd
which is true by default.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With