Hive - Efficient join of two tables

Tags:

I am joining two large tables in Hive (one is over 1 billion rows, one is about 100 million rows) like so:

create table joinedTable as select t1.id, ... from t1 join t2 ON (t1.id = t2.id);

I have bucketed the two tables in the same way, clustering by id into 100 buckets for each, but the query is still taking a long time.

Any suggestions on how to speed this up?

933

asked Nov 25 '13 17:11

maia

1 Answers

As you bucketed the data by the join keys, you could use the Bucket Map Join. For that the amount of buckets in one table must be a multiple of the amount of buckets in the other table. It can be activated by executing set hive.optimize.bucketmapjoin=true; before the query. If the tables don't meet the conditions, Hive will simply perform the normal Inner Join.

If both tables have the same amount of buckets and the data is sorted by the bucket keys, Hive can perform the faster Sort-Merge Join. To activate it, you have to execute the following commands:

set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;

You can find some visualizations of the different join techniques under https://cwiki.apache.org/confluence/download/attachments/27362054/Hive+Summit+2011-join.pdf.

111

answered Dec 04 '22 22:12

Adrian Lange

Related questions
                            
                                stress testing web applications on less capable hardware
                            
                                hackerrank new year chaos code optimization
                            
                                MS C# compiler and non-optimized code
                            
                                Using C/Intel assembly, what is the fastest way to test if a 128-byte memory block contains all zeros?
                            
                                Clearing up the `hidden classes` concept of V8
                            
                                Django - rendering many templates using templatetags is very slow
                            
                                Replacing nested for loops and value assignment for list comprehension
                            
                                What is the most efficient way to implement a convolution filter within a pixel shader?
                            
                                Can Multiple Indexes Work Together?
                            
                                C# XNA: Optimizing Collision Detection?
                            
                                What's the fastest way to build a string in Ruby?
                            
                                What is a cold/dead field and what is a peeling optimization?
                            
                                Does passing a variable with a large amount of data cost a lot of memory and time in Mathematica?
                            
                                Hint for branch prediction in assertions
                            
                                MySQL performance & variables tweaking
                            
                                Skip forbidden parameter combinations when using GridSearchCV
                            
                                Fast way to remove a few items from a list/queue
                            
                                Efficient way to take the minimum/maximum n values and indices from a matrix using NumPy
                            
                                How do I gzip webpage output with Rails?
                            
                                Optimal Buffer size for read-process-write

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Hive - Efficient join of two tables

Tags:

join

optimization

hive

buckets

maia

People also ask

1 Answers

Adrian Lange

Recent Activity

Donate For Us