I've set up a distributed Hadoop environment within VirtualBox: 4 virtual Ubuntu 11.10 installations, one acting as the master node, the other three as slaves. I followed this tutorial to get the single-node version up and running and then converted to the fully-distributed version. It was working just fine when I was running 11.04; however, when I upgraded to 11.10, it broke. Now all my slaves' logs show the following error message, repeated ad nauseum:
INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.1.10:54310. Already tried 0 time(s). INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.1.10:54310. Already tried 1 time(s). INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.1.10:54310. Already tried 2 time(s).
And so on. I've found other instances of this error message on the Internet (and StackOverflow) but none of the solutions have worked (tried changing the core-site.xml and mapred-site.xml entries to be the IP address rather than hostname; quadruple-checked /etc/hosts
on all slaves and master; master can SSH password-less into all slaves). I even tried reverting each slave back to a single-node setup, and they would all work fine in this case (on that note, the master always works fine as both a Datanode and the Namenode).
The only symptom I've found that would seem to give a lead is that from any of the slaves, when I attempt a telnet 192.168.1.10 54310
, I get Connection refused
, suggesting there is some rule blocking access (which must have gone into effect when I upgraded to 11.10).
My /etc/hosts.allow
has not changed, however. I tried the rule ALL: 192.168.1.
, but it did not change the behavior.
Oh yes, and netstat
on the master clearly shows tcp ports 54310 and 54311 are listening.
Anyone have any suggestions to get the slave Datanodes to recognize the Namenode?
EDIT #1: In doing some poking around with nmap (see comments on this post), I'm thinking the issue is in my /etc/hosts
files. This is what is listed for the master VM:
127.0.0.1 localhost 127.0.1.1 master 192.168.1.10 master 192.168.1.11 slave1 192.168.1.12 slave2 192.168.1.13 slave3
For each slave VM:
127.0.0.1 localhost 127.0.1.1 slaveX 192.168.1.10 master 192.168.1.1X slaveX
Unfortunately, I'm not sure what I changed, but the NameNode is now always dying with the exception of trying to bind a port "that's already in use" (127.0.1.1:54310). I'm clearly doing something wrong with the hostnames and IP addresses, but I'm really not sure what it is. Thoughts?
A block report of a particular Datanode contains information about all the blocks on that resides on the corresponding Datanode. When Namenode doesn't receive any heartbeat message for 10 minutes(ByDefault) from a particular Datanode then corresponding Datanode is considered Dead or failed by Namenode.
Data blocks on the failed Datanode are replicated on other Datanodes based on the specified replication factor in hdfs-site. xml file. Once the failed datanodes comes back the Name node will manage the replication factor again. This is how Namenode handles the failure of data node.
If you want to get the active namenode hostname from hdfs-site. xml file, you can go through following python script in github – https://github.com/grakala/getActiveNN.
Datanode daemon should be started manually using $HADOOP_HOME/bin/hadoop-daemon.sh script. Master (NameNode) should correspondingly join the cluster after automatically contacted. New node should be added to the configuration/slaves file in the master server. New node will be identified by script-based commands.
I found it! By commenting out the second line of the /etc/hosts
file (the one with the 127.0.1.1
entry), netstat
shows the NameNode ports binding to the 192.168.1.10
address instead of the local one, and the slave VMs found it. Ahhhhhhhh. Mystery solved! Thanks for everyone's help.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With