Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Addressing issues with Apache Spark application run in Client mode from Docker container

I'm trying to connect to Standalone Apache Spark cluster from a dockerized Apache Spark application using Client mode.

Driver gives the Spark Master and the Workers its address. When run inside a docker container it will use some_docker_container_ip. The docker address is not visible from outside so an application won't work.

Spark has spark.driver.host property. This property is passed to Master and Workers. My initial instinct was to pass host machine address in there so the cluster would address visible machine instead.

Unfortunately the spark.driver.host is also used to set up a server by Driver. Passing a host machine address in there will cause server startup errors because a docker container cannot bind ports under host machine host.

It seems like a lose-lose situation. I cannot use neither the host machine address nor the docker container address.

Ideally I would like to have two properties. The spark.driver.host-to-bind-to used to set up the driver server and the spark.driver.host-for-master which would be used by Master and Workers. Unfortunately it seems like I'm stuck with one property only.

Another approach would be to use --net=host when running a docker container. This approach has many disadvantages (e.g. other docker containers cannot get linked to a container with the --net=host on and must be exposed outside of the docker network) and I would like to avoid it.

Is there any way I could solve the driver-addressing problem without exposing the docker containers?

like image 894
Ajk Avatar asked Jul 22 '16 15:07

Ajk


People also ask

Can you run Spark in Docker?

A Apache Spark cluster can easily be setup with the default docker-compose. yml file from the root of this repo. The docker-compose includes two different services, spark-master and spark-worker. By default, when you deploy the docker-compose file you will get a Apache Spark cluster with 1 master and 1 worker.

Which is better client or cluster mode in Spark?

cluster mode is used to run production jobs. In client mode, the driver runs locally from where you are submitting your application using spark-submit command. client mode is majorly used for interactive and debugging purposes.

How do you Containerize a Spark application?

To containerize our app, we simply need to build and push it to Docker Hub. You'll need to have Docker running and be logged into Docker Hub as when we built the base image. You'll also need to be in the project directory ( cd ~/containerized-app ) to follow the rest of the steps.


1 Answers

This problem is fixed in https://github.com/apache/spark/pull/15120

It will be part of Apache Spark 2.1 release

like image 50
Ajk Avatar answered Sep 23 '22 17:09

Ajk