How to clean spark history event log with out stopping spark streaming

Tags:

We have a spark streaming application which is a long running task. The event log is pointed to hdfs location hdfs://spark-history, the application_XXX.inprogress file is being created in it when we start streaming application and size of the file growing up to 70GB. To delete the log file we are stopping spark streaming application and clearing it. Is there any way to automate this process with out stopping or restarting application. We have configured the spark.history.fs.cleaner.enabled=true with cleaning interval as 1 day and max Age as 2 days. however it is not cleaning the .inprogress file. we are using spark 1.6.2 version. We are running the spark on yarn and deployed in cluster mode.

953

asked Mar 14 '17 08:03

Vamshi Mothe

1 Answers

This issue you have to do some changes in few configurations, you have to add few changes to your file yarn-default.xml. In this file you need to change this row or add this row:

Click to copy

yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds=3600

This modification will aggregate your files to you, this will allow you to see the data via yarn logs -applicationId YOUR_APP_ID

This is the first step. You can see a little about this here.

Seccond Step you need to create a file log4j-driver.property and a log4j-executor.property

In this file you can use this example:

Click to copy

log4j.rootLogger=INFO, rolling
log4j.appender.rolling=org.apache.log4j.RollingFileAppender
log4j.appender.rolling.layout=org.apache.log4j.PatternLayout
log4j.appender.rolling.layout.conversionPattern=[%d] %p %m (%c)%n
log4j.appender.rolling.maxFileSize=50MB
log4j.appender.rolling.maxBackupIndex=5
log4j.appender.rolling.file=/var/log/spark/${dm.logging.name}.log
log4j.appender.rolling.encoding=UTF-8
log4j.logger.org.apache.spark=WARN
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.com.anjuke.dm=${dm.logging.level}

What this rows are saing?

This guy: log4j.appender.rolling.maxFileSize=50MB will create just files with 50MB of size. When a log file reach 50MB it will be closed and a new one will start.

The other row that is relevant is this one: log4j.appender.rolling.maxBackupIndex=5 this means that you will have a backup history of 5 files of 50MB. During the time this will be deleted when a new files start to show.

After you create this log file you need to send this via spark-submit command:

Click to copy

spark-submit
  --master spark://127.0.0.1:7077
  --driver-java-options "-Dlog4j.configuration=file:/path/to/log4j-driver.properties -Ddm.logging.level=DEBUG"
  --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/path/to/log4j-executor.properties -Ddm.logging.name=myapp -Ddm.logging.level=DEBUG"
  ...

You can create a log file for your Driver and for your Workers. In the command I'm using two different files but you can use the same. For more details you can see here.

104

answered Sep 17 '22 02:09

Thiago Baldim

Related questions
                            
                                Including a Spark Package JAR file in a SBT generated fat JAR
                            
                                Setting up a Spark SQL connection with Kerberos
                            
                                Spark and Hive table schema out of sync after external overwrite
                            
                                Should I persist a Spark dataframe if I keep adding columns in it?
                            
                                Read a bytes column in spark
                            
                                How to solve an assignment problem (like Hungarian/linear_sum_assignment) with an edge case in PySpark UDF
                            
                                Apache Spark: distinct doesnt work?
                            
                                How to do time-series simple forecast?
                            
                                How do I process a graph that is constantly updating, with low latency?
                            
                                Is it necessary to submit spark application jar?
                            
                                Elaboration on why shuffle write data is way more then input data in apache spark
                            
                                How to clean up other resources when spark gets stopped
                            
                                Amazon EMR - how to set a timeout for a step
                            
                                Does Spark allow to use Amazon Assumed Role and STS temporary credentials for DynamoDB?
                            
                                Pyspark read csv with schema, header check, and store corrupt records
                            
                                How to avoid one Spark Streaming window blocking another window with both running some native Python code
                            
                                Prevent more IO with multiple pipelines on the same RDD
                            
                                PCA in Spark MLlib and Spark ML
                            
                                How to get accuracy precision, recall and ROC from cross validation in Spark ml lib?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to clean spark history event log with out stopping spark streaming

Tags:

apache-spark

spark-streaming

Vamshi Mothe

People also ask

1 Answers

Thiago Baldim

Recent Activity

Donate For Us