2 basic questions that trouble me:
Background: I have a hive cluster of 32 machines, and:
"CLUSTERED BY(MY_KEY) INTO 32 BUCKETS"
hive.enforce.bucketing = true;
Thanks!
Hive allows users to read, write, and manage petabytes of data using SQL. Hive is built on top of Apache Hadoop, which is an open-source framework used to efficiently store and process large datasets. As a result, Hive is closely integrated with Hadoop, and is designed to work quickly on petabytes of data.
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
Hive internally uses a MapReduce framework as a defacto engine for executing the queries. MapReduce is a software framework for writing those applications that process a massive amount of data in parallel on the large clusters of commodity hardware.
Without joins, usual Hadoop Map Reduce mechanism for data locality is used (it is described in Spike's answer).
Specifically for the hive I would mention Map joins. It is possible to tell hive what is maximum size of the table for the Map only join. When one of the tables is small enough then Hive will replicate the this table to all nodes using distributed cache mechanism, and ensure that all the join process happens locally to data.
There is good explanation of the process:
http://www.facebook.com/note.php?note_id=470667928919
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With