Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hive Sort Merge Bucket Join

Tags:

join

hive

Is Sort merge Bucket Join different from Sort Merge Bucket Map join? If so, what hints should be added to enable SMB join? How is SMBM join superior to SMB join?

Will "set hive.auto.convert.sortmerge.join=true" this hint alone be sufficient for SMB join? Else should the below hints be included as well.

set hive.optimize.bucketmapjoin = true set hive.optimize.bucketmapjoin.sortedmerge = true

The reason I ask is, the hint says Bucket map join, but MAP join is not performed here. I am under the assumption that both map and reduce tasks are involved in SMB while only map tasks are involved in SMBM.

Please correct me if I am wrong.

like image 222
Bagavathi Avatar asked Jan 05 '23 02:01

Bagavathi


1 Answers

If your table is large(determined by "set hive.mapjoin.smalltable.filesize;"), you cannot do a map side join. Except that your tables are bucketed and sorted, and you turned on "set hive.optimize.bucketmapjoin.sortedmerge = true", then you can still do a map side join on large tables. (Of course, you still need "set hive.optimize.bucketmapjoin = true")

Make sure that your tables are truly bucketed and sorted on the same column. It's so easy to make mistakes. To get a bucketed and sorted table, you need to

  1. set hive.enforce.bucketing=true;
  2. set hive.enforce.sorting=true;
  3. DDL script

    CREATE table XXX ( id int, name string ) CLUSTERED BY (id) SORTED BY (id) INTO XXX BUCKETS ; INSERT OVERWRITE TABLE XXX select * from XXX CLUSTER BY member_id ;

Use describe formatted XXX and look for Num Buckets, Bucket Columns, Sort Columns to make sure it's correctly setup.

Other requirements for the bucket join is that two tables should have

  1. Data bucketed on the same columns, and they are used in the ON clause.
  2. The number of buckets for one table must be a multiple of the number of buckets for the other table.

If you meet all the requirements, then the MAP join will be performed. And it will be lightning fast.

By the way, SMB Map Join is not well supported in Hive 1.X for ORC format. You will get a null exception. The bug has been fixed in 2.X.

like image 115
GC001 Avatar answered Jan 12 '23 03:01

GC001