Can Hadoop distribute tasks and code base?

Tags:

I'm starting to play around with hadoop(but don't have access to a cluster yet so just playing around in standalone). My question is, once its in a cluster setup, how are tasks distributed and can the code base be transfered to new nodes?

Ideally, I would like to run large batch jobs and if I need more capacity add new nodes to a cluster but I'm not sure if I'll have to copy the same code thats running locally or do something special so while the batch job is running I can add capacity. I thought I could store my codebase on the HDFS and have it pulled locally to run every time I need it but that still means I need some kind of initial script on the server and need to run it manually first.

Any suggestions or advice on if this is possible would be great!

Thank you.

905

asked Feb 17 '12 15:02

Lostsoul

2 Answers

When you schedule a mapreduce job using the hadoop jar command, the jobtracker will determine how many mappers are needed to execute your job. This is usually determined by the number of blocks in the input file, and this number is fixed, no matter how many worker nodes you have. It then will enlist one or more tasktrackers to execute your job.

The application jar (along with any other jars that are specified using the -libjars argument), is copied automatically to all of the machines running the tasktrackers that are used to execute your jars. All of that is handled by the Hadoop infrastructure.

Adding additional tasktrackers will increase the parallelism of your job assuming that there are as-yet-unscheduled map tasks. What it will not do is automatically re-partition the input to parallelize across additional map capacity. So if you have a map capacity of 24 (assuming 6 mappers on each of 4 data nodes), and you have 100 map tasks with the first 24 executing, and you add another data node, you'll get some additional speed. If you have only 12 map tasks, adding machines won't help you.

Finally, you need to be aware of data reference locality. Since the data should ideally be processed on the same machines that store it initially, adding new task trackers will not necessarily add proportional processing speed, since the data will not be local on those nodes initially and will need to be copied over the network.

123

answered Oct 12 '22 14:10

Chris Shain

I do not quite agree with Daniel's reply. Primarily because if "on starting a job, jar code will be copied to all the nodes that the cluster knows of" is true, then even if you use 100 mappers and there are 1000 nodes, code for all jobs will always be copied to all the nodes. Does not make sense.

Instead Chris Shain's reply makes more sense that whenever JobScheduler on JobTracker chooses a job to be executed and identifies a task to be executed by a particular datanode then at this time somehow it conveys the tasktracker from where to copy the codebase.

Initially (before mapreduce job start), the codebase was copied to multiple locations as defined by mapred.submit.replication parameter. Hence, tasktracker can copy the codebase from several locations a list of which may be sent by jobtracker to it.

answered Oct 12 '22 13:10

Vaibhav Agarwal

Related questions
                            
                                EMR vs EC2/Hadoop on AWS
                            
                                Executing Sqoops using Oozie
                            
                                scala filename too long
                            
                                HBase regions automatic splitting using hbase.hregion.max.filesize
                            
                                Passing parameter to sqoop job
                            
                                Find out actual disk usage in HDFS
                            
                                get size of parquet file in HDFS for repartition with Spark in Scala
                            
                                Presto and hive partition discovery
                            
                                How Hadoop -getmerge works?
                            
                                Reading Avro file gives AvroTypeException: missing required field error (even though the new field is declared null in schema)
                            
                                Relationship between Hive and Hadoop MapReduce?
                            
                                Spark: grouping rows in array by key
                            
                                Authentication for Spark standalone cluster
                            
                                Unable to run yarn during hadoop installation
                            
                                How do I fix "File could only be replicated to 0 nodes instead of minReplication (=1)."?
                            
                                Does throwing an exception in an EvalFunc pig UDF skip just that line, or stop completely?
                            
                                ERROR: org.apache.hadoop.hbase.MasterNotRunningException: null+hbase+hadoop
                            
                                Ubuntu cluster management
                            
                                Why do we need to set the output key/value class explicitly in the Hadoop program?
                            
                                Hadoop MapReduce intermediate output

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Can Hadoop distribute tasks and code base?

Tags:

hadoop

hdfs

distributed

Lostsoul

People also ask

2 Answers

Chris Shain

Vaibhav Agarwal

Recent Activity

Donate For Us