Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache spark: setting spark.eventLog.enabled and spark.eventLog.dir at submit or Spark start

Tags:

apache-spark

I would like to set spark.eventLog.enabled and spark.eventLog.dir at the spark-submit or start-all level -- not require it to be enabled in the scala/java/python code. I have tried various things with no success:

Setting spark-defaults.conf as

spark.eventLog.enabled           true
spark.eventLog.dir               hdfs://namenode:8021/directory

or

spark.eventLog.enabled           true
spark.eventLog.dir               file:///some/where

Running spark-submit as:

spark-submit --conf "spark.eventLog.enabled=true" --conf "spark.eventLog.dir=file:///tmp/test" --master spark://server:7077 examples/src/main/python/pi.py

Starting spark with environment variables:

SPARK_DAEMON_JAVA_OPTS="-Dspark.eventLog.enabled=true -Dspark.history.fs.logDirectory=$sparkHistoryDir -Dspark.history.provider=org.apache.spark.deploy.history.FsHistoryProvider -Dspark.history.fs.cleaner.enabled=true -Dspark.history.fs.cleaner.interval=2d"

and just for overkill:

SPARK_HISTORY_OPTS="-Dspark.eventLog.enabled=true -Dspark.history.fs.logDirectory=$sparkHistoryDir -Dspark.history.provider=org.apache.spark.deploy.history.FsHistoryProvider -Dspark.history.fs.cleaner.enabled=true -Dspark.history.fs.cleaner.interval=2d"

Where and how must these things be set to get history on arbitrary jobs?

like image 435
SpmP Avatar asked Jul 05 '15 18:07

SpmP


People also ask

How are we monitoring batch and checking logs in spark?

Spark keeps a history of every application you run by creating a sub-directory for each application and logs the events specific to the application in this directory. You can also set the location like an HDFS directory so history files can be read by the history server.

What is spark submit?

The spark-submit script in Spark's bin directory is used to launch applications on a cluster. It can use all of Spark's supported cluster managers through a uniform interface so you don't have to configure your application especially for each one.

How do I check my spark status?

You can view the status of a Spark Application that is created for the notebook in the status widget on the notebook panel. The widget also displays links to the Spark UI, Driver Logs, and Kernel Log. Additionally, you can view the progress of the Spark job when you run the code.


1 Answers

I solved the problem, yet strangely I had tried this before... All the same, now it seems like a stable solution:

Create a directory in HDFS for logging, say /eventLogging

hdfs dfs -mkdir /eventLogging

Then spark-shell or spark-submit (or whatever) can be run with the following options:

--conf spark.eventLog.enabled=true --conf spark.eventLog.dir=hdfs://<hdfsNameNodeAddress>:8020/eventLogging

such as:

spark-shell --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=hdfs://<hdfsNameNodeAddress>:8020/eventLogging
like image 197
SpmP Avatar answered Oct 08 '22 02:10

SpmP