We have a spark streaming application which is a long running task. The event log is pointed to hdfs location hdfs://spark-history, the application_XXX.inprogress file is being created in it when we start streaming application and size of the file growing up to 70GB. To delete the log file we are stopping spark streaming application and clearing it. Is there any way to automate this process with out stopping or restarting application. We have configured the spark.history.fs.cleaner.enabled=true with cleaning interval as 1 day and max Age as 2 days. however it is not cleaning the .inprogress file. we are using spark 1.6.2 version. We are running the spark on yarn and deployed in cluster mode.
Solution to your Answer Before SparkContext to stop, use Thread. sleep(86400000). This will keep 24 hours active of your Spark UI until you kill the process.
You can stop a running instance of HistoryServer using $SPARK_HOME/sbin/stop-history-server.sh shell script.
eventLog. dir , by default the location is file:///tmp/spark-events . You need to create the directory in advance. Spark keeps a history of every application you run by creating a sub-directory for each application and logs the events specific to the application in this directory.
This issue you have to do some changes in few configurations, you have to add few changes to your file yarn-default.xml
. In this file you need to change this row or add this row:
yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds=3600
This modification will aggregate your files to you, this will allow you to see the data via yarn logs -applicationId YOUR_APP_ID
This is the first step. You can see a little about this here.
Seccond Step you need to create a file log4j-driver.property and a log4j-executor.property
In this file you can use this example:
log4j.rootLogger=INFO, rolling
log4j.appender.rolling=org.apache.log4j.RollingFileAppender
log4j.appender.rolling.layout=org.apache.log4j.PatternLayout
log4j.appender.rolling.layout.conversionPattern=[%d] %p %m (%c)%n
log4j.appender.rolling.maxFileSize=50MB
log4j.appender.rolling.maxBackupIndex=5
log4j.appender.rolling.file=/var/log/spark/${dm.logging.name}.log
log4j.appender.rolling.encoding=UTF-8
log4j.logger.org.apache.spark=WARN
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.com.anjuke.dm=${dm.logging.level}
What this rows are saing?
This guy: log4j.appender.rolling.maxFileSize=50MB
will create just files with 50MB of size. When a log file reach 50MB it will be closed and a new one will start.
The other row that is relevant is this one: log4j.appender.rolling.maxBackupIndex=5
this means that you will have a backup history of 5 files of 50MB. During the time this will be deleted when a new files start to show.
After you create this log file you need to send this via spark-submit command:
spark-submit
--master spark://127.0.0.1:7077
--driver-java-options "-Dlog4j.configuration=file:/path/to/log4j-driver.properties -Ddm.logging.level=DEBUG"
--conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/path/to/log4j-executor.properties -Ddm.logging.name=myapp -Ddm.logging.level=DEBUG"
...
You can create a log file for your Driver and for your Workers. In the command I'm using two different files but you can use the same. For more details you can see here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With