On the https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html there are words: <blockquote> the DataNodes are configured with the location of both NameNodes, and send block location information and heartbeats to both. </blockquote> But why is this information sent to the namenode and its fallback brother? I thought that this information already contains in the namenode's fs image. The namenode should know where he put blocks.

Name Node contains the meta data of the entire cluster. It contains the details of each folder, file, replication factor, block names etc. The Name Node also stores the information about the location of the blocks for each file (this information is constructed from the Block Reports sent by the Data Nodes) in memory. Data Nodes store following information for each block: <ul> <li>Actual data stored in the block </li> <li>Meta data for the data stored in the block. Mainly contains checksums for the data stored in the block.</li> </ul> They periodically send the heart beat and block reports to the Name Node. Heart Beat: <ul> <li>Interval of heart beat reports is determined by configuration parameter <code>dfs.heartbeat.interval</code> (in hdfs-site.xml). By default this is set to 3 seconds.</li> <li>Some of the information contained in the Heart beat is: <ul> <li> Registration: Data node registration information</li> <li> Capacity: Total storage capacity available at Data Node</li> <li> dfsUsed: Storage used by HDFS</li> <li> remaining: Remaining storage available for HDFS</li> <li> blockPoolUsed: Storage used by the block pool</li> <li> xmitsInProgress: Number of transfers from this Data Node to others</li> <li> xceiverCount: Number of active transceiver threads</li> <li> xmitsInProgress: Number of transfers from this Data Node to others</li> <li> cacheCapacity: Total cache capacity available at Data Node</li> <li> cacheUsed: Amount of cache used</li> </ul> </li> <li>This information is used by the Name Node in the following ways: <ul> <li> Health of the Data Node: Should this data node be marked as dead or alive?</li> <li> Registration of new Data Node: If this is a newly added Data Node, its information is registered</li> <li> Update the metrics of the Data Node: The information sent in the heart beat is used for updating the metrics of the node</li> <li> Issue commands to the Data Node: The Name Node can issue following commands to the Data Node, based on the information received in the heart beat: <code>BlockRecoveryCommand</code> (to recover specified blocks), <code>BlockCommand</code> (for transferring blocks to another Data Node, for invalidating certain blocks), <code>Cache/Uncache</code> (commands for caching / uncaching the blocks)</li> </ul> </li> </ul> Block Reports: <ul> <li>Interval of block reports is determined by configuration <code>dfs.blockreport.intervalMsec</code> (in hdfs-site.xml). By default this is set to 21600000 milliseconds.</li> <li>Some of the information contained in the block report is: <ul> <li> Registration: Data node registration information</li> <li> blocks: Information about the blocks, which contains: block ID, block length, block generation timestamp, state of the block replica (For e.g. replica is finalized or waiting to be recovered etc.) </li> </ul> </li> <li>This information is used by the Name Node for: <ul> <li> Process first block report: If it is a first time report for the newly registered Data Node, it just adds all the valid replicas. It ignores all the invalid blocks, till the next block report. </li> <li> For updating the information about blocks: The (Data Node -> Blocks) map is updated in the Name Node. The new block report is compared with the old report and information about successful blocks, corrupted blocks, invalidated blocks etc. is updated</li> </ul> </li> </ul>

Why datanode sends the block location information to namenode?

1 Answers

Name Node contains the meta data of the entire cluster. It contains the details of each folder, file, replication factor, block names etc. The Name Node also stores the information about the location of the blocks for each file (this information is constructed from the Block Reports sent by the Data Nodes) in memory.

Data Nodes store following information for each block:

Actual data stored in the block
Meta data for the data stored in the block. Mainly contains checksums for the data stored in the block.

They periodically send the heart beat and block reports to the Name Node.

Heart Beat:

Interval of heart beat reports is determined by configuration parameter dfs.heartbeat.interval (in hdfs-site.xml). By default this is set to 3 seconds.
Some of the information contained in the Heart beat is:
- Registration: Data node registration information
- Capacity: Total storage capacity available at Data Node
- dfsUsed: Storage used by HDFS
- remaining: Remaining storage available for HDFS
- blockPoolUsed: Storage used by the block pool
- xmitsInProgress: Number of transfers from this Data Node to others
- xceiverCount: Number of active transceiver threads
- xmitsInProgress: Number of transfers from this Data Node to others
- cacheCapacity: Total cache capacity available at Data Node
- cacheUsed: Amount of cache used
This information is used by the Name Node in the following ways:
- Health of the Data Node: Should this data node be marked as dead or alive?
- Registration of new Data Node: If this is a newly added Data Node, its information is registered
- Update the metrics of the Data Node: The information sent in the heart beat is used for updating the metrics of the node
- Issue commands to the Data Node: The Name Node can issue following commands to the Data Node, based on the information received in the heart beat: BlockRecoveryCommand (to recover specified blocks), BlockCommand (for transferring blocks to another Data Node, for invalidating certain blocks), Cache/Uncache (commands for caching / uncaching the blocks)

Block Reports:

Interval of block reports is determined by configuration dfs.blockreport.intervalMsec (in hdfs-site.xml). By default this is set to 21600000 milliseconds.
Some of the information contained in the block report is:
- Registration: Data node registration information
- blocks: Information about the blocks, which contains: block ID, block length, block generation timestamp, state of the block replica (For e.g. replica is finalized or waiting to be recovered etc.)
This information is used by the Name Node for:
- Process first block report: If it is a first time report for the newly registered Data Node, it just adds all the valid replicas. It ignores all the invalid blocks, till the next block report.
- For updating the information about blocks: The (Data Node -> Blocks) map is updated in the Name Node. The new block report is compared with the old report and information about successful blocks, corrupted blocks, invalidated blocks etc. is updated

answered Sep 26 '22 12:09

Manjunath Ballur

Related questions
                            
                                How to copy a file from a GCS bucket in Dataproc to HDFS using google cloud?
                            
                                Using the Apache Mahout machine learning libraries [closed]
                            
                                How to use Hadoop Streaming with LZO-compressed Sequence Files?
                            
                                How to get Filename/File Contents as key/value input for MAP when running a Hadoop MapReduce Job?
                            
                                Declaring a variable and schema in PIG
                            
                                How do I format and add files to hadoop after it crashed?
                            
                                how to load a tarball to pig
                            
                                How to tackle a BIG DATA Data Mart / Fact Table? ( 240 millions of rows )
                            
                                how to make hive take only specific files as input from hdfs folder
                            
                                Error in setting job.setInputFormatClass in Mapreduce
                            
                                Multiples Hadoop FileSystem instances
                            
                                Twitter Storm v/s Apache Hadoop
                            
                                How to get the current filename in Hadoop Reduce
                            
                                How to configure hosts file for Hadoop ecosystem
                            
                                Mapreduce job fail when submitted from windows machine
                            
                                Pig: Control number of mappers
                            
                                How to Join two tables in Hbase
                            
                                Why does Hadoop Spilling happens?
                            
                                Understanding closures and parallelism in Spark
                            
                                When are files "splittable"?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why datanode sends the block location information to namenode?

Tags:

hadoop

hdfs

serg

People also ask

1 Answers

Manjunath Ballur

Recent Activity

Donate For Us