While reading the book Hadoop: The Definitive Guide, I came across this page with the following line:
The namenode also knows the datanodes on which all the blocks for a given file are located, however, it does not store block locations persistently, since this information is reconstructed from datanodes when the system starts.
I am struggling to understand how this works. Let's say, that I copy a 1 GB file on an 8 node cluster with replication factor of 3. So each datanode will have 1 block and these blocks will be replicated on other nodes, bringing the total number of blocks on each node effectively to 3. Now the namenode is supposed to keep an index containing the location of each block. But according to the text, if the namenode does not store block locations persistently, how are they reconstructed after the cluster is shut down and restarted. There will be no way of telling which block belongs to which file. Can someone please explain this to me?
The namenode does preserve some state about the files (name, path, size, block size, block IDs etc), just not eh physical location of where the blocks are.
When the data nodes start up, they effectively tree walk the dfs data directory discovering all the file blocks they have and once complete, reports to the name node the blocks that it hosts.
The namenode builds up a map of the files to block locations from the reports from each data node.
This is one of the reasons it sometimes takes a few minutes to come out of safe mode when the cluster first starts up - if you have lots of files, it can take a few moments for each data node to tree walk and discover the blocks it hosts.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With