Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to clean spark history event log with out stopping spark streaming

We have a spark streaming application which is a long running task. The event log is pointed to hdfs location hdfs://spark-history, the application_XXX.inprogress file is being created in it when we start streaming application and size of the file growing up to 70GB. To delete the log file we are stopping spark streaming application and clearing it. Is there any way to automate this process with out stopping or restarting application. We have configured the spark.history.fs.cleaner.enabled=true with cleaning interval as 1 day and max Age as 2 days. however it is not cleaning the .inprogress file. we are using spark 1.6.2 version. We are running the spark on yarn and deployed in cluster mode.

like image 953
Vamshi Mothe Avatar asked Mar 14 '17 08:03

Vamshi Mothe


People also ask

How do I keep the Spark session alive?

Solution to your Answer Before SparkContext to stop, use Thread. sleep(86400000). This will keep 24 hours active of your Spark UI until you kill the process.

How do I stop Spark History server?

You can stop a running instance of HistoryServer using $SPARK_HOME/sbin/stop-history-server.sh shell script.

How do I check my Spark history?

eventLog. dir , by default the location is file:///tmp/spark-events . You need to create the directory in advance. Spark keeps a history of every application you run by creating a sub-directory for each application and logs the events specific to the application in this directory.


1 Answers

This issue you have to do some changes in few configurations, you have to add few changes to your file yarn-default.xml. In this file you need to change this row or add this row:

yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds=3600

This modification will aggregate your files to you, this will allow you to see the data via yarn logs -applicationId YOUR_APP_ID

This is the first step. You can see a little about this here.

Seccond Step you need to create a file log4j-driver.property and a log4j-executor.property

In this file you can use this example:

log4j.rootLogger=INFO, rolling
log4j.appender.rolling=org.apache.log4j.RollingFileAppender
log4j.appender.rolling.layout=org.apache.log4j.PatternLayout
log4j.appender.rolling.layout.conversionPattern=[%d] %p %m (%c)%n
log4j.appender.rolling.maxFileSize=50MB
log4j.appender.rolling.maxBackupIndex=5
log4j.appender.rolling.file=/var/log/spark/${dm.logging.name}.log
log4j.appender.rolling.encoding=UTF-8
log4j.logger.org.apache.spark=WARN
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.com.anjuke.dm=${dm.logging.level}

What this rows are saing?

This guy: log4j.appender.rolling.maxFileSize=50MB will create just files with 50MB of size. When a log file reach 50MB it will be closed and a new one will start.

The other row that is relevant is this one: log4j.appender.rolling.maxBackupIndex=5 this means that you will have a backup history of 5 files of 50MB. During the time this will be deleted when a new files start to show.

After you create this log file you need to send this via spark-submit command:

spark-submit
  --master spark://127.0.0.1:7077
  --driver-java-options "-Dlog4j.configuration=file:/path/to/log4j-driver.properties -Ddm.logging.level=DEBUG"
  --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/path/to/log4j-executor.properties -Ddm.logging.name=myapp -Ddm.logging.level=DEBUG"
  ...

You can create a log file for your Driver and for your Workers. In the command I'm using two different files but you can use the same. For more details you can see here.

like image 104
Thiago Baldim Avatar answered Sep 17 '22 02:09

Thiago Baldim