Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does hive/hadoop assures that each mapper works on data that is local for it?

Tags:

hadoop

hive

hdfs

2 basic questions that trouble me:

  • How can I be sure that each of the 32 files hive uses to store my tables sits on its unique machine?
  • If that happens, how can I be sure that if hive creates 32 mappers, each of them will work on its local data? Does hadoop/hdfs guarantees this magic, or does hive as a smart application makes sure that it will happen?

Background: I have a hive cluster of 32 machines, and:

  • All my tables are created with "CLUSTERED BY(MY_KEY) INTO 32 BUCKETS"
  • I use hive.enforce.bucketing = true;
  • I verified and indeed every table is stored as 32 files in the user/hive/warehouse
  • I'm using HDFS replication factor of 2

Thanks!

like image 394
ihadanny Avatar asked Aug 04 '11 12:08

ihadanny


People also ask

How does Hive work in Hadoop?

Hive allows users to read, write, and manage petabytes of data using SQL. Hive is built on top of Apache Hadoop, which is an open-source framework used to efficiently store and process large datasets. As a result, Hive is closely integrated with Hadoop, and is designed to work quickly on petabytes of data.

How does Hive deal with structured data?

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.

How Hive query works internally?

Hive internally uses a MapReduce framework as a defacto engine for executing the queries. MapReduce is a software framework for writing those applications that process a massive amount of data in parallel on the large clusters of commodity hardware.


2 Answers

  1. The data placement is determined by HDFS. It will try to balance bytes over machines. Due to replicate each file will be on two machines, which means you have two candidate machines for reading the data locally.
  2. HDFS knows where each files is stored, and Hadoop uses this information to place mappers on the same hosts as the data is stored. You can look at the counters for your job to see "data local" and "rack local" map task counts. This is a feature of Hadoop that you don't need to worry about.
like image 87
Spike Gronim Avatar answered Sep 25 '22 14:09

Spike Gronim


Without joins, usual Hadoop Map Reduce mechanism for data locality is used (it is described in Spike's answer).
Specifically for the hive I would mention Map joins. It is possible to tell hive what is maximum size of the table for the Map only join. When one of the tables is small enough then Hive will replicate the this table to all nodes using distributed cache mechanism, and ensure that all the join process happens locally to data. There is good explanation of the process: http://www.facebook.com/note.php?note_id=470667928919

like image 31
David Gruzman Avatar answered Sep 25 '22 14:09

David Gruzman