Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hive - Efficient join of two tables

I am joining two large tables in Hive (one is over 1 billion rows, one is about 100 million rows) like so:

create table joinedTable as select t1.id, ... from t1 join t2 ON (t1.id = t2.id);

I have bucketed the two tables in the same way, clustering by id into 100 buckets for each, but the query is still taking a long time.

Any suggestions on how to speed this up?

like image 933
maia Avatar asked Nov 25 '13 17:11

maia


People also ask

How do I join multiple tables in Hive?

Basically, to combine and retrieve the records from multiple tables we use Hive Join clause. Moreover, in SQL JOIN is as same as OUTER JOIN. Moreover, by using the primary keys and foreign keys of the tables JOIN condition is to be raised.

Can we join tables in Hive?

JOIN is a clause that is used for combining specific fields from two tables by using values common to each one. It is used to combine records from two or more tables in the database.

Is it possible to create Cartesian join between 2 tables using Hive?

In this recipe, you will learn how to use a cross join in Hive. Cross join, also known as Cartesian product, is a way of joining multiple tables in which all the rows or tuples from one table are paired with the rows and tuples from another table.


1 Answers

As you bucketed the data by the join keys, you could use the Bucket Map Join. For that the amount of buckets in one table must be a multiple of the amount of buckets in the other table. It can be activated by executing set hive.optimize.bucketmapjoin=true; before the query. If the tables don't meet the conditions, Hive will simply perform the normal Inner Join.

If both tables have the same amount of buckets and the data is sorted by the bucket keys, Hive can perform the faster Sort-Merge Join. To activate it, you have to execute the following commands:

set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;

You can find some visualizations of the different join techniques under https://cwiki.apache.org/confluence/download/attachments/27362054/Hive+Summit+2011-join.pdf.

like image 111
Adrian Lange Avatar answered Dec 04 '22 22:12

Adrian Lange