I'm trying to connect to Standalone Apache Spark cluster from a dockerized Apache Spark application using Client mode. Driver gives the Spark Master and the Workers its address. When run inside a docker container it will use <code>some_docker_container_ip</code>. The docker address is not visible from outside so an application won't work. Spark has <code>spark.driver.host</code> property. This property is passed to Master and Workers. My initial instinct was to pass host machine address in there so the cluster would address visible machine instead. Unfortunately the <code>spark.driver.host</code> is also used to set up a server by Driver. Passing a host machine address in there will cause server startup errors because a docker container cannot bind ports under host machine host. It seems like a lose-lose situation. I cannot use neither the host machine address nor the docker container address. Ideally I would like to have two properties. The <code>spark.driver.host-to-bind-to</code> used to set up the driver server and the <code>spark.driver.host-for-master</code> which would be used by Master and Workers. Unfortunately it seems like I'm stuck with one property only. Another approach would be to use <code>--net=host</code> when running a docker container. This approach has many disadvantages (e.g. other docker containers cannot get linked to a container with the <code>--net=host</code> on and must be exposed outside of the docker network) and I would like to avoid it. Is there any way I could solve the driver-addressing problem without exposing the docker containers?

This problem is fixed in https://github.com/apache/spark/pull/15120 It will be part of Apache Spark 2.1 release

Addressing issues with Apache Spark application run in Client mode from Docker container

Tags:

apache-spark

I'm trying to connect to Standalone Apache Spark cluster from a dockerized Apache Spark application using Client mode.

Driver gives the Spark Master and the Workers its address. When run inside a docker container it will use some_docker_container_ip. The docker address is not visible from outside so an application won't work.

Spark has spark.driver.host property. This property is passed to Master and Workers. My initial instinct was to pass host machine address in there so the cluster would address visible machine instead.

Unfortunately the spark.driver.host is also used to set up a server by Driver. Passing a host machine address in there will cause server startup errors because a docker container cannot bind ports under host machine host.

It seems like a lose-lose situation. I cannot use neither the host machine address nor the docker container address.

Ideally I would like to have two properties. The spark.driver.host-to-bind-to used to set up the driver server and the spark.driver.host-for-master which would be used by Master and Workers. Unfortunately it seems like I'm stuck with one property only.

Another approach would be to use --net=host when running a docker container. This approach has many disadvantages (e.g. other docker containers cannot get linked to a container with the --net=host on and must be exposed outside of the docker network) and I would like to avoid it.

Is there any way I could solve the driver-addressing problem without exposing the docker containers?

894

asked Jul 22 '16 15:07

Ajk

1 Answers

This problem is fixed in https://github.com/apache/spark/pull/15120

It will be part of Apache Spark 2.1 release

answered Sep 23 '22 17:09

Ajk

Related questions
                            
                                Spark Standalone Mode multiple shell sessions (applications)
                            
                                Specifying the output file name in Apache Spark
                            
                                Spark - convert string IDs to unique integer IDs
                            
                                Usage of local variables in closures when accessing Spark RDDs
                            
                                How do you read and write from/into different ElasticSearch clusters using spark and elasticsearch-hadoop?
                            
                                How to format data for the spark mlib kmeans clustering algorithm?
                            
                                How to extract complex JSON structures using Apache Spark 1.4.0 Data Frames
                            
                                If the one partition is lost, we can use lineage to reconstruct it. Will the base RDD be loaded again?
                            
                                Use Serializable lambda in Spark JavaRDD transformation
                            
                                How does Scala compiler handle unused variable values?
                            
                                Can I run a Time Series Database (TSDB) over Apache Spark?
                            
                                Spark Mesos Cluster Mode using Dispatcher
                            
                                Get SparkUncaughtExceptionHandler when run spark-perf
                            
                                How to use Analytic/Window Functions in Spark Java?
                            
                                Zeppelin throws java.lang.OutOfMemoryError: Java heap space
                            
                                ClassNotFoundException: org.apache.spark.repl.SparkCommandLine
                            
                                Submitting spark app as a yarn job from Eclipse and Spark Context
                            
                                "java.io.IOException: Class not found" on long running Streaming application
                            
                                How does Spark decide how to partition an RDD?
                            
                                How to resolve : Very large size tasks in spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Addressing issues with Apache Spark application run in Client mode from Docker container

Tags:

docker

mode

client

apache-spark

Ajk

People also ask

1 Answers

Ajk

Recent Activity

Donate For Us