I am trying to construct a docker image containing Apache Spark. IT is built upon the openjdk-8-jre official image.
The goal is to execute Spark in cluster mode, thus having at least one master (started via sbin/start-master.sh
) and one or more slaves (sbin/start-slave.sh
). See spark-standalone-docker for my Dockerfile and entrypoint script.
The build itself actually goes through, the problem is that when I want to run the container, it starts and stops shortly after. The cause is that Spark master launch script starts the master in daemon mode and exits. Thus the container terminates, as there is no process running in the foreground anymore.
The obvious solution is to run the Spark master process in foreground, but I could not figure out how (Google did not turn up anything either). My "workaround-solution" is to run tails -f
on the Spark log directory.
Thus, my questions are:
UPDATED ANSWER (for spark 2.4.0):
To start spark master on foreground, just set the ENV variable SPARK_NO_DAEMONIZE=true on your environment before running ./start-master.sh
and you are good to go.
for more info, check $SPARK_HOME/sbin/spark-daemon.sh
# Runs a Spark command as a daemon.
#
# Environment Variables
#
# SPARK_CONF_DIR Alternate conf dir. Default is ${SPARK_HOME}/conf.
# SPARK_LOG_DIR Where log files are stored. ${SPARK_HOME}/logs by default.
# SPARK_MASTER host:path where spark code should be rsync'd from
# SPARK_PID_DIR The pid files are stored. /tmp by default.
# SPARK_IDENT_STRING A string representing this instance of spark. $USER by default
# SPARK_NICENESS The scheduling priority for daemons. Defaults to 0.
# SPARK_NO_DAEMONIZE If set, will run the proposed command in the foreground. It will not output a PID file.
##
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With