Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop Datanode, namenode, secondary-namenode, job-tracker and task-tracker

Tags:

hadoop

I am new in hadoop so I have some doubts. If the master-node fails what happened the hadoop cluster? Can we recover that node without any loss? Is it possible to keep a secondary master-node to switch automatically to the master when the current one fails?

We have the backup of the namenode (Secondary namenode), so we can restore the namenode from Secondary namenode when it fails. Like this, How can we restore the data's in datanode when the datanode fails? The secondary namenode is the backup of namenode only not to datenode, right? If a node is failed before completion of a job, so there is job pending in job tracker, is that job continue or restart from the first in the free node?

How can we restore the entire cluster data if anything happens?

And my final question, can we use C program in Mapreduce (For example, Bubble sort in mapreduce)?

Thanks in advance

like image 467
user1002486 Avatar asked Oct 19 '11 06:10

user1002486


People also ask

What is the difference between job tracker and task tracker in Hadoop?

There can be multiple replications of that so it picks the local data and runs the task on that particular task tracker. The task tracker is the one that actually runs the task on the data node. Job tracker will pass the information to the task tracker and the task tracker will run the job on the data node.

Can DataNode communicate with task tracker?

TaskTracker runs on DataNode. Mostly on all DataNodes. TaskTracker is replaced by Node Manager in MRv2. TaskTracker will be in constant communication with the JobTracker signalling the progress of the task in execution.

What is DataNode and Namenode in Hadoop?

Datanode stores actual data and works as instructed by Namenode. A Hadoop file system can have multiple data nodes but only one active Namenode. Basic operations of Namenode: Namenode maintains and manages the Data Nodes and assigns the task to them.

What is task tracker in Hadoop?

A TaskTracker is a node in the cluster that accepts tasks - Map, Reduce and Shuffle operations - from a JobTracker. Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept.


2 Answers

Although, It is too late to answer your question but just It may help others..

First of all let me Introduce you with Secondary Name Node:

It Contains the name space image, edit log files' back up for past one hour (configurable). And its work is to merge latest Name Node NameSpaceImage and edit logs files to upload back to Name Node as replacement of the old one. To have a Secondary NN in a cluster is not mandatory.

Now coming to your concerns..

  • If the master-node fails what happened the hadoop cluster?

Supporting Frail's answer, Yes hadoop has single point of failure so whole of your currently running task like Map-Reduce or any other that is using the failed master node will stop. The whole cluster including client will stop working.

  • Can we recover that node without any loss?

That is hypothetical, Without loss it is least possible, as all the data (block reports) will lost which has sent by Data nodes to Name node after last back up taken by secondary name node. Why I mentioned least, because If name node fails just after a successful back up run by secondary name node then it is in safe state.

  • Is it possible to keep a secondary master-node to switch automatically to the master when the current one fails?

It is staright possible by an Administrator (User). And to switch it automatically you have to write a native code out of the cluster, Code to moniter the cluster that will cofigure the secondary name node smartly and restart the cluster with new name node address.

  • We have the backup of the namenode (Secondary namenode), so we can restore the namenode from Secondary namenode when it fails. Like this, How can we restore the data's in datanode when the datanode fails?

It is about replication factor, We have 3 (default as best practice, configurable) replicas of each file block all in different data nodes. So in case of failure for time being we have 2 back up data nodes. Later Name node will create one more replica of the data that failed data node contained.

  • The secondary namenode is the backup of namenode only not to datenode, right?

Right. It just contains all the metadata of data nodes like data node address,properties including block report of each data node.

  • If a node is failed before completion of a job, so there is job pending in job tracker, is that job continue or restart from the first in the free node?

HDFS will forcely try to continue the job. But again it depends on replication factor, rack awareness and other configuration made by admin. But if following Hadoop's best practices about HDFS then it will not get failed. JobTracker will get replicated node address to continnue.

  • How can we restore the entire cluster data if anything happens?

By Restarting it.

  • And my final question, can we use C program in Mapreduce (For example, Bubble sort in mapreduce)?

yes, you can use any programming language which support Standard file read write operations.

I Just gave a try. Hope it will help you as well as others.

*Suggestions/Improvements are welcome.*

like image 148
manurajhada Avatar answered Oct 08 '22 13:10

manurajhada


Currently hadoop cluster has a single point of failure which is namenode.

And about the secondary node isssue (from apache wiki) :

The term "secondary name-node" is somewhat misleading. It is not a name-node in the sense that data-nodes cannot connect to the secondary name-node, and in no event it can replace the primary name-node in case of its failure.

The only purpose of the secondary name-node is to perform periodic checkpoints. The secondary name-node periodically downloads current name-node image and edits log files, joins them into new image and uploads the new image back to the (primary and the only) name-node. See User Guide.

So if the name-node fails and you can restart it on the same physical node then there is no need to shutdown data-nodes, just the name-node need to be restarted. If you cannot use the old node anymore you will need to copy the latest image somewhere else. The latest image can be found either on the node that used to be the primary before failure if available; or on the secondary name-node. The latter will be the latest checkpoint without subsequent edits logs, that is the most recent name space modifications may be missing there. You will also need to restart the whole cluster in this case.

There are tricky ways to overcome this single point of failure. If you are using cloudera distribution, one of the ways explained here. Mapr distribution has a different way to handle to this spof.

Finally, you can use every single programing language to write map reduce over hadoop streaming.

like image 25
frail Avatar answered Oct 08 '22 14:10

frail