Is Sort merge Bucket Join different from Sort Merge Bucket Map join? If so, what hints should be added to enable SMB join? How is SMBM join superior to SMB join?
Will "set hive.auto.convert.sortmerge.join=true" this hint alone be sufficient for SMB join? Else should the below hints be included as well.
set hive.optimize.bucketmapjoin = true set hive.optimize.bucketmapjoin.sortedmerge = true
The reason I ask is, the hint says Bucket map join, but MAP join is not performed here. I am under the assumption that both map and reduce tasks are involved in SMB while only map tasks are involved in SMBM.
Please correct me if I am wrong.
If your table is large(determined by "set hive.mapjoin.smalltable.filesize;"), you cannot do a map side join. Except that your tables are bucketed and sorted, and you turned on "set hive.optimize.bucketmapjoin.sortedmerge = true", then you can still do a map side join on large tables. (Of course, you still need "set hive.optimize.bucketmapjoin = true")
Make sure that your tables are truly bucketed and sorted on the same column. It's so easy to make mistakes. To get a bucketed and sorted table, you need to
DDL script
CREATE table XXX
(
id int,
name string
)
CLUSTERED BY (id)
SORTED BY (id)
INTO XXX BUCKETS
;
INSERT OVERWRITE TABLE XXX
select * from XXX
CLUSTER BY member_id
;
Use describe formatted XXX
and look for Num Buckets, Bucket Columns, Sort Columns
to make sure it's correctly setup.
Other requirements for the bucket join is that two tables should have
If you meet all the requirements, then the MAP join will be performed. And it will be lightning fast.
By the way, SMB Map Join is not well supported in Hive 1.X for ORC format. You will get a null exception. The bug has been fixed in 2.X.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With