Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Docker Container with Apache Spark in standalone cluster mode

I am trying to construct a docker image containing Apache Spark. IT is built upon the openjdk-8-jre official image.

The goal is to execute Spark in cluster mode, thus having at least one master (started via sbin/start-master.sh) and one or more slaves (sbin/start-slave.sh). See spark-standalone-docker for my Dockerfile and entrypoint script.

The build itself actually goes through, the problem is that when I want to run the container, it starts and stops shortly after. The cause is that Spark master launch script starts the master in daemon mode and exits. Thus the container terminates, as there is no process running in the foreground anymore.

The obvious solution is to run the Spark master process in foreground, but I could not figure out how (Google did not turn up anything either). My "workaround-solution" is to run tails -f on the Spark log directory.

Thus, my questions are:

  1. How can you run Apache Spark Master in foreground?
  2. If the first is not possible / feasible / whatever, what is the preferred (i.e. best practice) solution to keeping a container "alive" (I really don't want to use an infinite loop and a sleep command)?
like image 905
akoeltringer Avatar asked Sep 23 '16 23:09

akoeltringer


1 Answers

UPDATED ANSWER (for spark 2.4.0):

To start spark master on foreground, just set the ENV variable SPARK_NO_DAEMONIZE=true on your environment before running ./start-master.sh

and you are good to go.

for more info, check $SPARK_HOME/sbin/spark-daemon.sh

# Runs a Spark command as a daemon.
#
# Environment Variables
#
#   SPARK_CONF_DIR  Alternate conf dir. Default is ${SPARK_HOME}/conf.
#   SPARK_LOG_DIR   Where log files are stored. ${SPARK_HOME}/logs by default.
#   SPARK_MASTER    host:path where spark code should be rsync'd from
#   SPARK_PID_DIR   The pid files are stored. /tmp by default.
#   SPARK_IDENT_STRING   A string representing this instance of spark. $USER by default
#   SPARK_NICENESS The scheduling priority for daemons. Defaults to 0.
#   SPARK_NO_DAEMONIZE   If set, will run the proposed command in the foreground. It will not output a PID file.
##
like image 105
dsncode Avatar answered Oct 18 '22 14:10

dsncode