Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

After manually rebalancing hadoop hdfs disks DataNode won't restart

Tags:

hadoop

I use Hadoop hadoop-2.0.0-mr1-cdh4.1.2 in a cluster of 40 machines. Each machine has 12 disks used by hadoop. Some disks in one machine were unbalanced, and I decided to re-balance manually as mentioned in this post: rebalance individual datanode in hadoop I stopped the DataNode on that server, moved block file pairs, moved whole sub-directories between some of the disks.

As soon as I stopped the DataNode, the NameNode complained about missing blocks by displaying the following message in the UI: WARNING : There are 2002 missing blocks. Please check the logs or run fsck in order to identify the missing blocks.

Then, I tried to restart the DataNode. It refuses to successfully start and it keeps logging errors and warnings such as follows:

java.io.IOException: Invalid directory or I/O error occurred for dir: /data/disk3/dfs/data/current/BP-208475052-10.165.18.36-1351280731538/current/finalized/subdir61/subdir28

2013-12-20 01:40:29,046 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService java.io.IOException: block pool BP-208475052-10.165.18.36-1351280731538 is not found

2013-12-20 01:40:29,088 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in BPOfferService for Block pool BP-208475052-10.165.18.36-1351280731538 (storage id DS-737580588-10.165.18.36-50010-1351280778276) service to aspen8hdp19.turner.com/10.165.18.56:54310 java.lang.NullPointerException

2013-12-20 01:40:34,088 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService java.io.IOException: block pool BP-208475052-10.165.18.36-1351280731538 is not found

So, I have some questions:

  • Isn't it enough to follow the approach I mentioned? I.e. stop DataNode, move block file pairs and/or subdirectories, restart DataNode.
  • Do I need to restart NameNode or other services?
  • Why does it complain about missing blocks or corrupt files?
  • How can I restart the DataNode and get rid of those exceptions therefore having the DN communicate successfully with the NN?

I appreciate your help. Eduardo.

like image 960
user3121783 Avatar asked Nov 02 '22 08:11

user3121783


1 Answers

I'm going to answer my own question here.

The problem I had was caused by having the wrong file/dir permissions and ownership after I moved the data blocks. I did the move as root and moved files ended up with the following permissions:

drwx-----T 2 root root 12288 Dec 19 23:14 subdir28

Once I changed it back to the original, the DN restarted properly and the NN stopped reporting missing blocks or corrupt files. Here's the permissions that it should have:

drwxr-xr-t 2 hdfs hadoop 12288 Dec 20 11:47 subdir28

like image 148
user3121783 Avatar answered Nov 15 '22 08:11

user3121783