I'm trying to connect to Standalone Apache Spark cluster from a dockerized Apache Spark application using Client mode.
Driver gives the Spark Master and the Workers its address. When run inside a docker container it will use some_docker_container_ip
. The docker address is not visible from outside so an application won't work.
Spark has spark.driver.host
property. This property is passed to Master and Workers. My initial instinct was to pass host machine address in there so the cluster would address visible machine instead.
Unfortunately the spark.driver.host
is also used to set up a server by Driver. Passing a host machine address in there will cause server startup errors because a docker container cannot bind ports under host machine host.
It seems like a lose-lose situation. I cannot use neither the host machine address nor the docker container address.
Ideally I would like to have two properties. The spark.driver.host-to-bind-to
used to set up the driver server and the spark.driver.host-for-master
which would be used by Master and Workers. Unfortunately it seems like I'm stuck with one property only.
Another approach would be to use --net=host
when running a docker container. This approach has many disadvantages (e.g. other docker containers cannot get linked to a container with the --net=host
on and must be exposed outside of the docker network) and I would like to avoid it.
Is there any way I could solve the driver-addressing problem without exposing the docker containers?
A Apache Spark cluster can easily be setup with the default docker-compose. yml file from the root of this repo. The docker-compose includes two different services, spark-master and spark-worker. By default, when you deploy the docker-compose file you will get a Apache Spark cluster with 1 master and 1 worker.
cluster mode is used to run production jobs. In client mode, the driver runs locally from where you are submitting your application using spark-submit command. client mode is majorly used for interactive and debugging purposes.
To containerize our app, we simply need to build and push it to Docker Hub. You'll need to have Docker running and be logged into Docker Hub as when we built the base image. You'll also need to be in the project directory ( cd ~/containerized-app ) to follow the rest of the steps.
This problem is fixed in https://github.com/apache/spark/pull/15120
It will be part of Apache Spark 2.1 release
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With