I am joining two large tables in Hive (one is over 1 billion rows, one is about 100 million rows) like so:
create table joinedTable as select t1.id, ... from t1 join t2 ON (t1.id = t2.id);
I have bucketed the two tables in the same way, clustering by id into 100 buckets for each, but the query is still taking a long time.
Any suggestions on how to speed this up?
Basically, to combine and retrieve the records from multiple tables we use Hive Join clause. Moreover, in SQL JOIN is as same as OUTER JOIN. Moreover, by using the primary keys and foreign keys of the tables JOIN condition is to be raised.
JOIN is a clause that is used for combining specific fields from two tables by using values common to each one. It is used to combine records from two or more tables in the database.
In this recipe, you will learn how to use a cross join in Hive. Cross join, also known as Cartesian product, is a way of joining multiple tables in which all the rows or tuples from one table are paired with the rows and tuples from another table.
As you bucketed the data by the join keys, you could use the Bucket Map Join. For that the amount of buckets in one table must be a multiple of the amount of buckets in the other table. It can be activated by executing set hive.optimize.bucketmapjoin=true;
before the query. If the tables don't meet the conditions, Hive will simply perform the normal Inner Join.
If both tables have the same amount of buckets and the data is sorted by the bucket keys, Hive can perform the faster Sort-Merge Join. To activate it, you have to execute the following commands:
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
You can find some visualizations of the different join techniques under https://cwiki.apache.org/confluence/download/attachments/27362054/Hive+Summit+2011-join.pdf.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With