I've set up a distributed Hadoop environment within VirtualBox: 4 virtual Ubuntu 11.10 installations, one acting as the master node, the other three as slaves. I followed this tutorial to get the single-node version up and running and then converted to the fully-distributed version. It was working just fine when I was running 11.04; however, when I upgraded to 11.10, it broke. Now all my slaves' logs show the following error message, repeated ad nauseum: <pre class="prettyprint"><code>INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.1.10:54310. Already tried 0 time(s). INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.1.10:54310. Already tried 1 time(s). INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.1.10:54310. Already tried 2 time(s). </code></pre> And so on. I've found other instances of this error message on the Internet (and StackOverflow) but none of the solutions have worked (tried changing the core-site.xml and mapred-site.xml entries to be the IP address rather than hostname; quadruple-checked <code>/etc/hosts</code> on all slaves and master; master can SSH password-less into all slaves). I even tried reverting each slave back to a single-node setup, and they would all work fine in this case (on that note, the master always works fine as both a Datanode and the Namenode). The only symptom I've found that would seem to give a lead is that from any of the slaves, when I attempt a <code>telnet 192.168.1.10 54310</code>, I get <code>Connection refused</code>, suggesting there is some rule blocking access (which must have gone into effect when I upgraded to 11.10). My <code>/etc/hosts.allow</code> has not changed, however. I tried the rule <code>ALL: 192.168.1.</code>, but it did not change the behavior. Oh yes, and <code>netstat</code> on the master clearly shows tcp ports 54310 and 54311 are listening. Anyone have any suggestions to get the slave Datanodes to recognize the Namenode? EDIT #1: In doing some poking around with nmap (see comments on this post), I'm thinking the issue is in my <code>/etc/hosts</code> files. This is what is listed for the master VM: <pre class="prettyprint"><code>127.0.0.1 localhost 127.0.1.1 master 192.168.1.10 master 192.168.1.11 slave1 192.168.1.12 slave2 192.168.1.13 slave3 </code></pre> For each slave VM: <pre class="prettyprint"><code>127.0.0.1 localhost 127.0.1.1 slaveX 192.168.1.10 master 192.168.1.1X slaveX </code></pre> Unfortunately, I'm not sure what I changed, but the NameNode is now always dying with the exception of trying to bind a port "that's already in use" (127.0.1.1:54310). I'm clearly doing something wrong with the hostnames and IP addresses, but I'm really not sure what it is. Thoughts?

I found it! By commenting out the second line of the <code>/etc/hosts</code> file (the one with the <code>127.0.1.1</code> entry), <code>netstat</code> shows the NameNode ports binding to the <code>192.168.1.10</code> address instead of the local one, and the slave VMs found it. Ahhhhhhhh. Mystery solved! Thanks for everyone's help.

Hadoop Datanodes cannot find NameNode

Tags:

I've set up a distributed Hadoop environment within VirtualBox: 4 virtual Ubuntu 11.10 installations, one acting as the master node, the other three as slaves. I followed this tutorial to get the single-node version up and running and then converted to the fully-distributed version. It was working just fine when I was running 11.04; however, when I upgraded to 11.10, it broke. Now all my slaves' logs show the following error message, repeated ad nauseum:

INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.1.10:54310. Already tried 0 time(s). INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.1.10:54310. Already tried 1 time(s). INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.1.10:54310. Already tried 2 time(s).

And so on. I've found other instances of this error message on the Internet (and StackOverflow) but none of the solutions have worked (tried changing the core-site.xml and mapred-site.xml entries to be the IP address rather than hostname; quadruple-checked /etc/hosts on all slaves and master; master can SSH password-less into all slaves). I even tried reverting each slave back to a single-node setup, and they would all work fine in this case (on that note, the master always works fine as both a Datanode and the Namenode).

The only symptom I've found that would seem to give a lead is that from any of the slaves, when I attempt a telnet 192.168.1.10 54310, I get Connection refused, suggesting there is some rule blocking access (which must have gone into effect when I upgraded to 11.10).

My /etc/hosts.allow has not changed, however. I tried the rule ALL: 192.168.1., but it did not change the behavior.

Oh yes, and netstat on the master clearly shows tcp ports 54310 and 54311 are listening.

Anyone have any suggestions to get the slave Datanodes to recognize the Namenode?

EDIT #1: In doing some poking around with nmap (see comments on this post), I'm thinking the issue is in my /etc/hosts files. This is what is listed for the master VM:

127.0.0.1    localhost 127.0.1.1    master 192.168.1.10 master 192.168.1.11 slave1 192.168.1.12 slave2 192.168.1.13 slave3

For each slave VM:

127.0.0.1    localhost 127.0.1.1    slaveX 192.168.1.10 master 192.168.1.1X slaveX

Unfortunately, I'm not sure what I changed, but the NameNode is now always dying with the exception of trying to bind a port "that's already in use" (127.0.1.1:54310). I'm clearly doing something wrong with the hostnames and IP addresses, but I'm really not sure what it is. Thoughts?

710

asked Jan 15 '12 19:01

Magsol

1 Answers

I found it! By commenting out the second line of the /etc/hosts file (the one with the 127.0.1.1 entry), netstat shows the NameNode ports binding to the 192.168.1.10 address instead of the local one, and the slave VMs found it. Ahhhhhhhh. Mystery solved! Thanks for everyone's help.

100

answered Oct 25 '22 21:10

Magsol

Related questions
                            
                                Get month name from number in PostgreSQL
                            
                                length of System.currentTimeMillis
                            
                                Add request parameter to request
                            
                                Running console applications on other monitor
                            
                                Custom ItemsSource property for a UserControl
                            
                                What's the difference between the implements & extends keywords in Java [duplicate]
                            
                                Regular Expression - 2 letters and 2 numbers in C#
                            
                                Validate presence of one of multiple attributes in rails
                            
                                Powershell test for noninteractive mode
                            
                                Compiling (javac) a UTF8 encoded Java source code with a BOM
                            
                                Why does Foreman not output some things until I press Control-C?
                            
                                Time complexity of depth-first graph algorithm [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Hadoop Datanodes cannot find NameNode

Tags:

Magsol

People also ask

1 Answers

Magsol

Recent Activity

Donate For Us