I'm starting to play around with hadoop(but don't have access to a cluster yet so just playing around in standalone). My question is, once its in a cluster setup, how are tasks distributed and can the code base be transfered to new nodes?
Ideally, I would like to run large batch jobs and if I need more capacity add new nodes to a cluster but I'm not sure if I'll have to copy the same code thats running locally or do something special so while the batch job is running I can add capacity. I thought I could store my codebase on the HDFS and have it pulled locally to run every time I need it but that still means I need some kind of initial script on the server and need to run it manually first.
Any suggestions or advice on if this is possible would be great!
Thank you.
What are Hadoop Distributions? Hadoop distributions are used to provide scalable, distributed computing against on-premises and cloud-based file store data. Distributions are composed of commercially packaged and supported editions of open-source Apache Hadoop-related projects.
MapReduce jobs have two types of tasks. A Map Task is a single instance of a MapReduce app. These tasks determine which records to process from a data block. The input data is split and analyzed, in parallel, on the assigned compute resources in a Hadoop cluster.
Apache Hadoop is an open source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly.
It can handle both structured and unstructured form of data. It is more flexible in storing, processing, and managing data than traditional RDBMS. Unlike traditional systems, Hadoop enables multiple analytical processes on the same data at the same time. It supports scalability very flexibly.
Hadoop systems can handle various forms of structured and unstructured data, giving users more flexibility for collecting, processing, analyzing and managing data than relational databases and data warehouses provide.
A core difference between Hadoop and HDFS is that Hadoop is the open source framework that can store, process and analyze data, while HDFS is the file system of Hadoop that provides access to data. This essentially means that HDFS is a module of Hadoop.
When you schedule a mapreduce job using the hadoop jar
command, the jobtracker will determine how many mappers are needed to execute your job. This is usually determined by the number of blocks in the input file, and this number is fixed, no matter how many worker nodes you have. It then will enlist one or more tasktrackers to execute your job.
The application jar (along with any other jars that are specified using the -libjars
argument), is copied automatically to all of the machines running the tasktrackers that are used to execute your jars. All of that is handled by the Hadoop infrastructure.
Adding additional tasktrackers will increase the parallelism of your job assuming that there are as-yet-unscheduled map tasks. What it will not do is automatically re-partition the input to parallelize across additional map capacity. So if you have a map capacity of 24 (assuming 6 mappers on each of 4 data nodes), and you have 100 map tasks with the first 24 executing, and you add another data node, you'll get some additional speed. If you have only 12 map tasks, adding machines won't help you.
Finally, you need to be aware of data reference locality. Since the data should ideally be processed on the same machines that store it initially, adding new task trackers will not necessarily add proportional processing speed, since the data will not be local on those nodes initially and will need to be copied over the network.
I do not quite agree with Daniel's reply. Primarily because if "on starting a job, jar code will be copied to all the nodes that the cluster knows of" is true, then even if you use 100 mappers and there are 1000 nodes, code for all jobs will always be copied to all the nodes. Does not make sense.
Instead Chris Shain's reply makes more sense that whenever JobScheduler on JobTracker chooses a job to be executed and identifies a task to be executed by a particular datanode then at this time somehow it conveys the tasktracker from where to copy the codebase.
Initially (before mapreduce job start), the codebase was copied to multiple locations as defined by mapred.submit.replication parameter. Hence, tasktracker can copy the codebase from several locations a list of which may be sent by jobtracker to it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With