2 basic questions that trouble me: <ul> <li>How can I be sure that each of the 32 files hive uses to store my tables sits on its unique machine?</li> <li>If that happens, how can I be sure that if hive creates 32 mappers, each of them will work on its local data? Does hadoop/hdfs guarantees this magic, or does hive as a smart application makes sure that it will happen?</li> </ul> Background: I have a hive cluster of 32 machines, and: <ul> <li>All my tables are created with <code>"CLUSTERED BY(MY_KEY) INTO 32 BUCKETS"</code> </li> <li>I use <code>hive.enforce.bucketing = true;</code> </li> <li>I verified and indeed every table is stored as 32 files in the user/hive/warehouse</li> <li>I'm using HDFS replication factor of 2</li> </ul> Thanks!

<ol> <li>The data placement is determined by HDFS. It will try to balance bytes over machines. Due to replicate each file will be on two machines, which means you have two candidate machines for reading the data locally.</li> <li>HDFS knows where each files is stored, and Hadoop uses this information to place mappers on the same hosts as the data is stored. You can look at the counters for your job to see "data local" and "rack local" map task counts. This is a feature of Hadoop that you don't need to worry about.</li> </ol>

How does hive/hadoop assures that each mapper works on data that is local for it?

2 Answers

The data placement is determined by HDFS. It will try to balance bytes over machines. Due to replicate each file will be on two machines, which means you have two candidate machines for reading the data locally.
HDFS knows where each files is stored, and Hadoop uses this information to place mappers on the same hosts as the data is stored. You can look at the counters for your job to see "data local" and "rack local" map task counts. This is a feature of Hadoop that you don't need to worry about.

answered Sep 25 '22 14:09

Spike Gronim

Without joins, usual Hadoop Map Reduce mechanism for data locality is used (it is described in Spike's answer).
Specifically for the hive I would mention Map joins. It is possible to tell hive what is maximum size of the table for the Map only join. When one of the tables is small enough then Hive will replicate the this table to all nodes using distributed cache mechanism, and ensure that all the join process happens locally to data. There is good explanation of the process: http://www.facebook.com/note.php?note_id=470667928919

answered Sep 25 '22 14:09

David Gruzman

Related questions
                            
                                How does Hadoop get input data not stored on HDFS?
                            
                                Getting an error on running HCatalog
                            
                                Can I change Spark's executor memory at runtime?
                            
                                NoSuchMethodError writing Avro object to HDFS using Builder
                            
                                Unable to connect with azure blob storage with local hadoop
                            
                                Hive : casting array<string> to array<int> in query
                            
                                Can we get all the column names from an HBase table?
                            
                                Towards limiting the big RDD
                            
                                How can I know spark-core version?
                            
                                Unable to load data in Hive partitioned table
                            
                                How to convert timestamp (with dot between second and millisecond) to date(yyyyMMdd) in Hive?
                            
                                Impala/Hive to get list of tables along with its size
                            
                                Setup and configuration of JanusGraph for a Spark cluster and Cassandra
                            
                                Immediate evaluation of CTE
                            
                                Spark Dataframe hanging on save
                            
                                Remote access to HDFS on Kubernetes
                            
                                Job 65 cancelled because SparkContext was shut down
                            
                                hadoop beginners question
                            
                                Should I prefer hadoop vs condor when working with R?
                            
                                Cassandra wih Hive

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does hive/hadoop assures that each mapper works on data that is local for it?

Tags:

hadoop

hive

hdfs

ihadanny

People also ask

2 Answers

Spike Gronim

David Gruzman

Recent Activity

Donate For Us