I would like to set spark.eventLog.enabled
and spark.eventLog.dir
at the spark-submit
or start-all
level -- not require it to be enabled in the scala/java/python code.
I have tried various things with no success:
spark-defaults.conf
asspark.eventLog.enabled true
spark.eventLog.dir hdfs://namenode:8021/directory
or
spark.eventLog.enabled true
spark.eventLog.dir file:///some/where
spark-submit
as:spark-submit --conf "spark.eventLog.enabled=true" --conf "spark.eventLog.dir=file:///tmp/test" --master spark://server:7077 examples/src/main/python/pi.py
SPARK_DAEMON_JAVA_OPTS="-Dspark.eventLog.enabled=true -Dspark.history.fs.logDirectory=$sparkHistoryDir -Dspark.history.provider=org.apache.spark.deploy.history.FsHistoryProvider -Dspark.history.fs.cleaner.enabled=true -Dspark.history.fs.cleaner.interval=2d"
and just for overkill:
SPARK_HISTORY_OPTS="-Dspark.eventLog.enabled=true -Dspark.history.fs.logDirectory=$sparkHistoryDir -Dspark.history.provider=org.apache.spark.deploy.history.FsHistoryProvider -Dspark.history.fs.cleaner.enabled=true -Dspark.history.fs.cleaner.interval=2d"
Where and how must these things be set to get history on arbitrary jobs?
Spark keeps a history of every application you run by creating a sub-directory for each application and logs the events specific to the application in this directory. You can also set the location like an HDFS directory so history files can be read by the history server.
The spark-submit script in Spark's bin directory is used to launch applications on a cluster. It can use all of Spark's supported cluster managers through a uniform interface so you don't have to configure your application especially for each one.
You can view the status of a Spark Application that is created for the notebook in the status widget on the notebook panel. The widget also displays links to the Spark UI, Driver Logs, and Kernel Log. Additionally, you can view the progress of the Spark job when you run the code.
I solved the problem, yet strangely I had tried this before... All the same, now it seems like a stable solution:
Create a directory in HDFS
for logging, say /eventLogging
hdfs dfs -mkdir /eventLogging
Then spark-shell
or spark-submit
(or whatever) can be run with the following options:
--conf spark.eventLog.enabled=true --conf spark.eventLog.dir=hdfs://<hdfsNameNodeAddress>:8020/eventLogging
such as:
spark-shell --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=hdfs://<hdfsNameNodeAddress>:8020/eventLogging
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With