With HDFS or HFTP URI scheme (e.g. hdfs://namenode/path/to/file
) I can access HDFS clusters without requiring their XML configuration files. It is very handy when running shell commands like hdfs dfs -get
, hadoop distcp
or reading files from Spark like sc.hadoopFile()
, because I don't have to copy and manage xml files for all relevant HDFS clusters to all nodes that those codes might potentially run.
One drawback of this approach is that I have to use the active NameNode's hostname, otherwise Hadoop will throw an exception complaining that the NN is standby.
A usual workaround is to try one and then try another if any exception is caught, or to connect to ZooKeeper directly and parse the binary data using protobuf.
Both of these methods are cumbersome, when compared to (for example) mysql's loadbalance URI or ZooKeeper's connection string where I can just comma-separate all hosts in the URI and the driver automatically finds a node to talk to.
Say I have active and standby namenode hosts nn1
and nn2
. What is the simplest way to refer a specific path of the HDFS, which:
hdfs
, hadoop
The HDFS NameNode High Availability feature enables you to run redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby. This eliminates the NameNode as a potential single point of failure (SPOF) in an HDFS cluster.
In HDFS NameNode High Availability Architecture, two NameNodes run at the same time. We can Implement the Active and Standby NameNode configuration in following two ways: Using Quorum Journal Nodes. Using Shared Storage.
The URI format is scheme://authority/path. For HDFS the scheme is hdfs, and for the Local FS the scheme is file. The scheme and authority are optional. If not specified, the default scheme specified in the configuration is used.
The high availability feature in Hadoop ensures the availability of the Hadoop cluster without any downtime, even in unfavorable conditions like NameNode failure, DataNode failure, machine crash, etc. It means if the machine crashes, data will be accessible from another path.
In this scenarion instead of checking for active namenode host and port combination, we should use nameservice as, nameservice will automatically transfer client requests to active namenode.
Name service acts like a proxy among Namenodes, which always divert HDFS request to active namenode
Example: hdfs://nameservice_id/file/path/in/hdfs
nameservice
In hdfs-site.xml file
Create a nameservice by adding an id to it(here nameservice_id is mycluster)
<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
<description>Logical name for this new nameservice</description>
</property>
Now specify namenode ids to determine namenodes in cluster
dfs.ha.namenodes.[$nameservice ID]:
<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>nn1,nn2</value>
<description>Unique identifiers for each NameNode in the nameservice</description>
</property>
Then link namenode ids with namenode hosts
dfs.namenode.rpc-address.[$nameservice ID].[$name node ID]
<property>
<name>dfs.namenode.rpc-address.mycluster.nn1</name>
<value>machine1.example.com:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn2</name>
<value>machine2.example.com:8020</value>
</property>
There are so many properties involved to Configure Namenode HA properly with Nameservice
With this setup the HDFS url for a file will looks like this
hdfs://mycluster/file/location/in/hdfs/wo/namenode/host
Applying properties with java code
Configuration conf = new Configuration(false);
conf.set("dfs.nameservices","mycluster");
conf.set("dfs.ha.namenodes.mycluster","nn1,nn2");
conf.set("dfs.namenode.rpc-address.mycluster.nn1","machine1.example.com:8020");
conf.set("dfs.namenode.rpc-address.mycluster.nn2","machine2.example.com:8020");
FileSystem fsObj = FileSystem.get("relative/path/of/file/or/dir", conf);
// now use fsObj to perform HDFS shell like operations
fsObj ...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With