How to access hdfs by URI consisting of H/A namenodes in Spark which is outer hadoop cluster?

Tags:

Now I have some Spark applications which store output to HDFS.

Since our hadoop cluster is consisting of namenode H/A, and spark cluster is outside of hadoop cluster (I know it is something bad) I need to specify HDFS URI to application so that it can access HDFS.

But it doesn't recognize name service so I can only give one of namenode's URI, and if it fails, modify configuration file and try again.

Accessing Zookeeper for revealing active seems to very annoying, so I'd like to avoid.

Could you suggest any alternatives?

249

asked Jun 12 '15 06:06

Jungtaek Lim

2 Answers

Suppose your nameservice is 'hadooptest', then set the hadoop configurations like below. You can get these information from hdfs-site.xml file of remote HA enabled HDFS.

sc.hadoopConfiguration.set("dfs.nameservices", "hadooptest")
sc.hadoopConfiguration.set("dfs.client.failover.proxy.provider.hadooptest", "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider")
sc.hadoopConfiguration.set("dfs.ha.namenodes.hadooptest", "nn1,nn2")
sc.hadoopConfiguration.set("dfs.namenode.rpc-address.hadooptest.nn1", "10.10.14.81:8020")
sc.hadoopConfiguration.set("dfs.namenode.rpc-address.hadooptest.nn2", "10.10.14.82:8020")

After this, you can use the URL with 'hadooptest' like below.

test.write.orc("hdfs://hadooptest/tmp/test/r1")

check here for more information.

192

answered Oct 07 '22 07:10

Mungeol Heo

If you want to make a H/A HDFS cluster as your default config (mostly the case) that applies to every application started through spark-submit or spark-shell. you could write the cluster information into spark-defaults.conf.

sudo vim $SPARK_HOME/conf/spark-defaults.conf

And add the following lines. assuming your HDFS cluster name is hdfs-k8s

spark.hadoop.dfs.nameservices   hdfs-k8s
spark.hadoop.dfs.ha.namenodes.hdfs-k8s  nn0,nn1
spark.hadoop.dfs.namenode.rpc-address.hdfs-k8s.nn0 192.168.23.55:8020
spark.hadoop.dfs.namenode.rpc-address.hdfs-k8s.nn1 192.168.23.56:8020
spark.hadoop.dfs.client.failover.proxy.provider.hdfs-k8s    org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

It should work when your next application launched.

sc.addPyFile('hdfs://hdfs-k8s/user/root/env.zip')

answered Oct 07 '22 06:10

ttimasdf

Related questions
                            
                                Hadoop: FSCK result shows missing replicas
                            
                                Unable to establish a JDBC connection to Hive from Eclipse
                            
                                merge multiple small files in to few larger files in Spark
                            
                                Hadoop fs -du-h sorting by size for M, G, T, P, E, Z, Y
                            
                                Forward fill missing values in Spark/Python
                            
                                Hive Data to Pandas Data frame
                            
                                Stream data into hdfs directly without copying
                            
                                org.apache.maven.plugin.MojoExecutionException: protoc failure
                            
                                Deleting files from HDFS does not free up disk space
                            
                                How does Apache Spark handles system failure when deployed in YARN?
                            
                                Why YARN java heap space memory error?
                            
                                Hive Internal Error: java.lang.ClassNotFoundException(org.apache.atlas.hive.hook.HiveHook)
                            
                                Running yarn with spark not working with Java 8
                            
                                Hive join set number of reducers
                            
                                Hadoop: job runs okay on smaller set of data but fails with large dataset
                            
                                More than 120 counters in hadoop
                            
                                Compute differences between succesive records in Hadoop with Hive Queries
                            
                                Convert string to timestamp in Hive
                            
                                Could not find or load main class when trying to format namenode; hadoop installation on MAC OS X 10.9.2
                            
                                How to install RHadoop packages (Rmr, Rhdfs, Rhbase)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to access hdfs by URI consisting of H/A namenodes in Spark which is outer hadoop cluster?

Tags:

apache-spark

hadoop

hdfs

Jungtaek Lim

People also ask

2 Answers

Mungeol Heo

ttimasdf

Recent Activity

Donate For Us