Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to access hdfs by URI consisting of H/A namenodes in Spark which is outer hadoop cluster?

Now I have some Spark applications which store output to HDFS.

Since our hadoop cluster is consisting of namenode H/A, and spark cluster is outside of hadoop cluster (I know it is something bad) I need to specify HDFS URI to application so that it can access HDFS.

But it doesn't recognize name service so I can only give one of namenode's URI, and if it fails, modify configuration file and try again.

Accessing Zookeeper for revealing active seems to very annoying, so I'd like to avoid.

Could you suggest any alternatives?

like image 249
Jungtaek Lim Avatar asked Jun 12 '15 06:06

Jungtaek Lim


People also ask

How do I access spark from HDFS?

​Accessing HDFS from PySpark export HADOOP_CONF_DIR=/etc/hadoop/conf [hrt_qa@ip-172-31-42-188 spark]$ pyspark [hrt_qa@ip-172-31-42-188 spark]$ >>>lines = sc. textFile("hdfs://ip-172-31-42-188.ec2.internal:8020/tmp/PySparkTest/file-01") .......

How do multiple NameNodes ensure high availability of data in HDFS?

In HDFS NameNode High Availability Architecture, two NameNodes run at the same time. We can Implement the Active and Standby NameNode configuration in following two ways: Using Quorum Journal Nodes. Using Shared Storage.

What is HDFS URI?

The URI format is scheme://authority/path. For HDFS the scheme is hdfs, and for the Local FS the scheme is file. The scheme and authority are optional. If not specified, the default scheme specified in the configuration is used.


2 Answers

Suppose your nameservice is 'hadooptest', then set the hadoop configurations like below. You can get these information from hdfs-site.xml file of remote HA enabled HDFS.

sc.hadoopConfiguration.set("dfs.nameservices", "hadooptest")
sc.hadoopConfiguration.set("dfs.client.failover.proxy.provider.hadooptest", "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider")
sc.hadoopConfiguration.set("dfs.ha.namenodes.hadooptest", "nn1,nn2")
sc.hadoopConfiguration.set("dfs.namenode.rpc-address.hadooptest.nn1", "10.10.14.81:8020")
sc.hadoopConfiguration.set("dfs.namenode.rpc-address.hadooptest.nn2", "10.10.14.82:8020")

After this, you can use the URL with 'hadooptest' like below.

test.write.orc("hdfs://hadooptest/tmp/test/r1")

check here for more information.

like image 192
Mungeol Heo Avatar answered Oct 07 '22 07:10

Mungeol Heo


If you want to make a H/A HDFS cluster as your default config (mostly the case) that applies to every application started through spark-submit or spark-shell. you could write the cluster information into spark-defaults.conf.

sudo vim $SPARK_HOME/conf/spark-defaults.conf

And add the following lines. assuming your HDFS cluster name is hdfs-k8s

spark.hadoop.dfs.nameservices   hdfs-k8s
spark.hadoop.dfs.ha.namenodes.hdfs-k8s  nn0,nn1
spark.hadoop.dfs.namenode.rpc-address.hdfs-k8s.nn0 192.168.23.55:8020
spark.hadoop.dfs.namenode.rpc-address.hdfs-k8s.nn1 192.168.23.56:8020
spark.hadoop.dfs.client.failover.proxy.provider.hdfs-k8s    org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

It should work when your next application launched.

sc.addPyFile('hdfs://hdfs-k8s/user/root/env.zip')
like image 3
ttimasdf Avatar answered Oct 07 '22 06:10

ttimasdf