Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why datanode sends the block location information to namenode?

Tags:

hadoop

hdfs

On the https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html there are words:

the DataNodes are configured with the location of both NameNodes, and send block location information and heartbeats to both.

But why is this information sent to the namenode and its fallback brother? I thought that this information already contains in the namenode's fs image. The namenode should know where he put blocks.

like image 735
serg Avatar asked Dec 11 '15 16:12

serg


People also ask

What information heartbeat carries from Datanode to NameNode?

NameNode that receives the Heartbeats from a DataNode also carries information like total storage capacity, the fraction of storage in use, and the number of data transfers currently in progress. For the NameNode's block allocation and load balancing decisions, we use these statistics.

How does NameNode communicate with Datanode?

All communication between Namenode and Datanode is initiated by the Datanode, and responded to by the Namenode. The Namenode never initiates communication to the Datanode, although Namenode responses may include commands to the Datanode that cause it to send further communications.

What are the two messages that NameNode receives from Datanode in Hadoop?

Namenode periodically receives a heartbeat and a Block report from each Datanode in the cluster. Every Datanode sends heartbeat message after every 3 seconds to Namenode.

What messages are transacted between NameNode and Datanode?

DataNodes sends information to the NameNode about the files and blocks stored in that node and responds to the NameNode for all filesystem operations.


1 Answers

Name Node contains the meta data of the entire cluster. It contains the details of each folder, file, replication factor, block names etc. The Name Node also stores the information about the location of the blocks for each file (this information is constructed from the Block Reports sent by the Data Nodes) in memory.

Data Nodes store following information for each block:

  • Actual data stored in the block
  • Meta data for the data stored in the block. Mainly contains checksums for the data stored in the block.

They periodically send the heart beat and block reports to the Name Node.

Heart Beat:

  • Interval of heart beat reports is determined by configuration parameter dfs.heartbeat.interval (in hdfs-site.xml). By default this is set to 3 seconds.
  • Some of the information contained in the Heart beat is:
    • Registration: Data node registration information
    • Capacity: Total storage capacity available at Data Node
    • dfsUsed: Storage used by HDFS
    • remaining: Remaining storage available for HDFS
    • blockPoolUsed: Storage used by the block pool
    • xmitsInProgress: Number of transfers from this Data Node to others
    • xceiverCount: Number of active transceiver threads
    • xmitsInProgress: Number of transfers from this Data Node to others
    • cacheCapacity: Total cache capacity available at Data Node
    • cacheUsed: Amount of cache used
  • This information is used by the Name Node in the following ways:
    • Health of the Data Node: Should this data node be marked as dead or alive?
    • Registration of new Data Node: If this is a newly added Data Node, its information is registered
    • Update the metrics of the Data Node: The information sent in the heart beat is used for updating the metrics of the node
    • Issue commands to the Data Node: The Name Node can issue following commands to the Data Node, based on the information received in the heart beat: BlockRecoveryCommand (to recover specified blocks), BlockCommand (for transferring blocks to another Data Node, for invalidating certain blocks), Cache/Uncache (commands for caching / uncaching the blocks)

Block Reports:

  • Interval of block reports is determined by configuration dfs.blockreport.intervalMsec (in hdfs-site.xml). By default this is set to 21600000 milliseconds.
  • Some of the information contained in the block report is:
    • Registration: Data node registration information
    • blocks: Information about the blocks, which contains: block ID, block length, block generation timestamp, state of the block replica (For e.g. replica is finalized or waiting to be recovered etc.)
  • This information is used by the Name Node for:
    • Process first block report: If it is a first time report for the newly registered Data Node, it just adds all the valid replicas. It ignores all the invalid blocks, till the next block report.
    • For updating the information about blocks: The (Data Node -> Blocks) map is updated in the Name Node. The new block report is compared with the old report and information about successful blocks, corrupted blocks, invalidated blocks etc. is updated
like image 64
Manjunath Ballur Avatar answered Sep 26 '22 12:09

Manjunath Ballur