I have 3 DataNodes and 1 NameNode on a machine inside LXC containers. The DataNode on the same node as the NameNode works fine but the other 2 don't i get:
Initialization failed for Block pool BP-232943349-10.0.3.112-1417116665984
(Datanode Uuid null) service to hadoop12.domain.local/10.0.3.112:8022
Datanode denied communication with namenode because hostname cannot be resolved
(ip=10.0.3.233, hostname=10.0.3.233): DatanodeRegistration(10.0.3.114,
datanodeUuid=49a6dc47-c988-4cb8-bd84-9fabf87807bf, infoPort=50075, ipcPort=50020,
storageInfo=lv=-56;cid=cluster24;nsid=11020533;c=0)
in the log file
note that my NameNode is at IP 10.0.3.112
, and the DataNode failing is at 10.0.3.114
in this case.
All nodes FQDNs are defined in the hosts file on all nodes, and I can ping each node from all others.
What puzzles me here is that the DataNode is trying to locate the NameNode at 10.0.3.233
which is NOT an IP in the list nor the IP of the NameNode
Why? where is this setup?
The second DataNode that fails is at 10.0.3.113
and also looks for a different IP (10.0.3.158
) that it can't resolve because it's not defined and does not exist in my setup.
The node that works is at 10.0.3.112
like the NameNode, yet in the log I see it is working with src/ and dst/ files that are IPs out of the range I use.
like this:
src: /10.0.3.112:50010, dest: /10.0.3.180:53246, bytes: 60, op: HDFS_READ,
cliID: DFSClient_NONMAPREDUCE_-939581249_2253, offset: 0, srvID: a83af9ba-4e1a-47b3-a5d4-
f437ef60c287, blockid: BP-232943349-10.0.3.112-1417116665984:blk_1073742468_1644,
duration: 1685666
so what exactly is going on here, and how comes I can't reach the NameNode when all my nodes see and resolve each other?
Thanks for help
PS: the /etc/hosts file looks like this:
127.0.0.1 localhost
# The following lines are desirable for IPv6 capable hosts
::1 localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
10.0.3.1 bigdata.domain.local
192.168.10.33 bigdata.domain.local
10.0.3.111 hadoop11.domain.local
10.0.3.112 hadoop12.domain.local
10.0.3.113 hadoop13.domain.local
10.0.3.114 hadoop14.domain.local
10.0.3.115 hadoop15.domain.local
10.0.3.116 hadoop16.domain.local
10.0.3.117 hadoop17.domain.local
10.0.3.118 hadoop18.domain.local
10.0.3.119 hadoop19.domain.local
10.0.3.121 hadoop21.domain.local
10.0.3.122 hadoop22.domain.local
10.0.3.123 hadoop23.domain.local
10.0.3.124 hadoop24.domain.local
10.0.3.125 hadoop25.domain.local
10.0.3.126 hadoop26.domain.local
10.0.3.127 hadoop27.domain.local
10.0.3.128 hadoop28.domain.local
10.0.3.129 hadoop29.domain.local
core-site.xml:
<?xml version="1.0" encoding="UTF-8"?>
<!--Autogenerated by Cloudera Manager-->
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://nameservice1</value>
</property>
<property>
<name>fs.trash.interval</name>
<value>1</value>
</property>
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.DeflateCodec,org.apache.hadoop.io.compress.SnappyCodec,org.apache.hadoop.io.compress.Lz4Codec</value>
</property>
<property>
<name>hadoop.security.authentication</name>
<value>simple</value>
</property>
<property>
<name>hadoop.security.authorization</name>
<value>false</value>
</property>
<property>
<name>hadoop.rpc.protection</name>
<value>authentication</value>
</property>
<property>
<name>hadoop.ssl.require.client.cert</name>
<value>false</value>
<final>true</final>
</property>
<property>
<name>hadoop.ssl.keystores.factory.class</name>
<value>org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory</value>
<final>true</final>
</property>
<property>
<name>hadoop.ssl.server.conf</name>
<value>ssl-server.xml</value>
<final>true</final>
</property>
<property>
<name>hadoop.ssl.client.conf</name>
<value>ssl-client.xml</value>
<final>true</final>
</property>
<property>
<name>hadoop.security.auth_to_local</name>
<value>DEFAULT</value>
</property>
<property>
<name>hadoop.proxyuser.oozie.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.oozie.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.mapred.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.mapred.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.flume.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.flume.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.HTTP.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.HTTP.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hive.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hive.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hue.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hue.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.httpfs.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.httpfs.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hdfs.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hdfs.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.security.group.mapping</name>
<value>org.apache.hadoop.security.ShellBasedUnixGroupsMapping</value>
</property>
<property>
<name>hadoop.security.instrumentation.requires.admin</name>
<value>false</value>
</property>
</configuration>
hdfs-site.xml
<
!--Autogenerated by Cloudera Manager-->
<configuration>
<property>
<name>dfs.nameservices</name>
<value>nameservice1</value>
</property>
<property>
<name>dfs.client.failover.proxy.provider.nameservice1</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<property>
<name>dfs.ha.automatic-failover.enabled.nameservice1</name>
<value>true</value>
</property>
<property>
<name>ha.zookeeper.quorum</name>
<value>hadoop12.domain.local:2181,hadoop13.domain.local:2181,hadoop14.domain.local:2181</value>
</property>
<property>
<name>dfs.ha.namenodes.nameservice1</name>
<value>namenode114,namenode137</value>
</property>
<property>
<name>dfs.namenode.rpc-address.nameservice1.namenode114</name>
<value>hadoop12.domain.local:8020</value>
</property>
<property>
<name>dfs.namenode.servicerpc-address.nameservice1.namenode114</name>
<value>hadoop12.domain.local:8022</value>
</property>
<property>
<name>dfs.namenode.http-address.nameservice1.namenode114</name>
<value>hadoop12.domain.local:50070</value>
</property>
<property>
<name>dfs.namenode.https-address.nameservice1.namenode114</name>
<value>hadoop12.domain.local:50470</value>
</property>
<property>
<name>dfs.namenode.rpc-address.nameservice1.namenode137</name>
<value>hadoop14.domain.local:8020</value>
</property>
<property>
<name>dfs.namenode.servicerpc-address.nameservice1.namenode137</name>
<value>hadoop14.domain.local:8022</value>
</property>
<property>
<name>dfs.namenode.http-address.nameservice1.namenode137</name>
<value>hadoop14.domain.local:50070</value>
</property>
<property>
<name>dfs.namenode.https-address.nameservice1.namenode137</name>
<value>hadoop14.domain.local:50470</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.blocksize</name>
<value>134217728</value>
</property>
<property>
<name>dfs.client.use.datanode.hostname</name>
<value>true</value>
</property>
<property>
<name>fs.permissions.umask-mode</name>
<value>022</value>
</property>
<property>
<name>dfs.namenode.acls.enabled</name>
<value>false</value>
</property>
<property>
<name>dfs.client.read.shortcircuit</name>
<value>false</value>
</property>
<property>
<name>dfs.domain.socket.path</name>
<value>/var/run/hdfs-sockets/dn</value>
</property>
<property>
<name>dfs.client.read.shortcircuit.skip.checksum</name>
<value>false</value>
</property>
<property>
<name>dfs.client.domain.socket.data.traffic</name>
<value>false</value>
</property>
<property>
<name>dfs.datanode.hdfs-blocks-metadata.enabled</name>
<value>true</value>
</property>
</configuration>
You can just change the configure of hdfs-site.xml
of namenode
Notice the dfs.namenode.datanode.registration.ip-hostname-check
After a lot of issues with this setup, I finally figured what was wrong... Even though my config was right when i set up, it actually happens that resolvconf (the program) tends to reset the /etc/resolv.conf configuration file and overwrite my settings for search domain.local
It also happens that Cloudera and Hadoop use various ways to determine IP address, and unfortunately they are not consistant: Cloudera first looks up IP using SSH, which like PING and other programs use the GLIBC resolver, but later uses HOST, which by passes the GLIBC resolver and uses DNS directly, including the /etc/hosts file, and the /etc/resolv.conf file
So, at first it would work fine, but RESOLVCONF would automatically override my domain and search settings and mess things up.
I ended up REMOVING resolveconf from my setup, and with the proper files in place (hosts, resolv.conf) and making sure HOST resolves to the FQDN, it's all good. So the trick was to remove RESOLVCONF which is installed by default since Ubuntu 10.04 I believe. This is of course true on a local setup like mine. On an actual cluster setup on a network with DNS, just make sure the DNS resolves nodes properly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With