Cleaning up Spark history logs

Tags:

apache-spark

We have long running EMR cluster where we submit Spark jobs. I see that over time the HDFS fills up with the Spark application logs which sometimes renders a host unhealthy as viewed by EMR/Yarn (?).

Running hadoop fs -R -h / shows [1] which clearly shows no application logs have ever been deleted.

We have set the spark.history.fs.cleaner.enabled to true (validated this in the Spark UI) and were hoping the other defaults like cleaner interval (1 day) and cleaner max age (7d) as mentioned at: http://spark.apache.org/docs/latest/monitoring.html#spark-configuration-options would take care of cleaning up these logs. But that is not the case.

Any ideas?

[1]

-rwxrwx---   2 hadoop spark      543.1 M 2017-01-11 13:13 /var/log/spark/apps/application_1484079613665_0001
-rwxrwx---   2 hadoop spark        7.8 G 2017-01-17 10:51 /var/log/spark/apps/application_1484079613665_0002.inprogress
-rwxrwx---   2 hadoop spark        1.4 G 2017-01-18 08:11 /var/log/spark/apps/application_1484079613665_0003
-rwxrwx---   2 hadoop spark        2.9 G 2017-01-20 07:41 /var/log/spark/apps/application_1484079613665_0004
-rwxrwx---   2 hadoop spark      125.9 M 2017-01-20 09:57 /var/log/spark/apps/application_1484079613665_0005
-rwxrwx---   2 hadoop spark        4.4 G 2017-01-23 10:19 /var/log/spark/apps/application_1484079613665_0006
-rwxrwx---   2 hadoop spark        6.6 M 2017-01-23 10:31 /var/log/spark/apps/application_1484079613665_0007
-rwxrwx---   2 hadoop spark       26.4 M 2017-01-23 11:09 /var/log/spark/apps/application_1484079613665_0008
-rwxrwx---   2 hadoop spark       37.4 M 2017-01-23 11:53 /var/log/spark/apps/application_1484079613665_0009
-rwxrwx---   2 hadoop spark      111.9 M 2017-01-23 13:57 /var/log/spark/apps/application_1484079613665_0010
-rwxrwx---   2 hadoop spark        1.3 G 2017-01-24 10:26 /var/log/spark/apps/application_1484079613665_0011
-rwxrwx---   2 hadoop spark        7.0 M 2017-01-24 10:37 /var/log/spark/apps/application_1484079613665_0012
-rwxrwx---   2 hadoop spark       50.7 M 2017-01-24 11:40 /var/log/spark/apps/application_1484079613665_0013
-rwxrwx---   2 hadoop spark       96.2 M 2017-01-24 13:27 /var/log/spark/apps/application_1484079613665_0014
-rwxrwx---   2 hadoop spark      293.7 M 2017-01-24 17:58 /var/log/spark/apps/application_1484079613665_0015
-rwxrwx---   2 hadoop spark        7.6 G 2017-01-30 07:01 /var/log/spark/apps/application_1484079613665_0016
-rwxrwx---   2 hadoop spark        1.3 G 2017-01-31 02:59 /var/log/spark/apps/application_1484079613665_0017
-rwxrwx---   2 hadoop spark        2.1 G 2017-02-01 12:04 /var/log/spark/apps/application_1484079613665_0018
-rwxrwx---   2 hadoop spark        2.8 G 2017-02-03 08:32 /var/log/spark/apps/application_1484079613665_0019
-rwxrwx---   2 hadoop spark        5.4 G 2017-02-07 02:03 /var/log/spark/apps/application_1484079613665_0020
-rwxrwx---   2 hadoop spark        9.3 G 2017-02-13 03:58 /var/log/spark/apps/application_1484079613665_0021
-rwxrwx---   2 hadoop spark        2.0 G 2017-02-14 11:13 /var/log/spark/apps/application_1484079613665_0022
-rwxrwx---   2 hadoop spark        1.1 G 2017-02-15 03:49 /var/log/spark/apps/application_1484079613665_0023
-rwxrwx---   2 hadoop spark        8.8 G 2017-02-21 05:42 /var/log/spark/apps/application_1484079613665_0024
-rwxrwx---   2 hadoop spark      371.2 M 2017-02-21 11:54 /var/log/spark/apps/application_1484079613665_0025
-rwxrwx---   2 hadoop spark        1.4 G 2017-02-22 09:17 /var/log/spark/apps/application_1484079613665_0026
-rwxrwx---   2 hadoop spark        3.2 G 2017-02-24 12:36 /var/log/spark/apps/application_1484079613665_0027
-rwxrwx---   2 hadoop spark        9.5 M 2017-02-24 12:48 /var/log/spark/apps/application_1484079613665_0028
-rwxrwx---   2 hadoop spark       20.5 G 2017-03-10 04:00 /var/log/spark/apps/application_1484079613665_0029
-rwxrwx---   2 hadoop spark        7.3 G 2017-03-10 04:04 /var/log/spark/apps/application_1484079613665_0030.inprogress

852

asked Mar 15 '17 18:03

Swaranga Sarma

1 Answers

I was running into this issue on emr-5.4.0, and set spark.history.fs.cleaner.interval to 1h, and was able to get the cleaner to run.

For reference, here is the end of my spark-defaults.conf file:

spark.history.fs.cleaner.enabled true
spark.history.fs.cleaner.maxAge  12h
spark.history.fs.cleaner.interval 1h

After you make the change, restart your spark history server.

Another clarification: Setting these values during application run, i.e spark-submit via --conf has no effect. Either set them at cluster creation time via the EMR configuration API or manually edit the spark-defaults.conf, set these values and restart the spark history server. Also note that the logs will be cleaned up the next time your Spark app restarts. For instance, if you have a long running Spark streaming job, it will not delete any logs for that application run and will keep accumulating logs. And when the next time the job restarts (may be because of a deployment) it will cleanup the older logs.

111

answered Sep 22 '22 09:09

Ferris Tseng

Related questions
                            
                                How to execute Spark programs with Dynamic Resource Allocation?
                            
                                Difference between reduce and reduceByKey in Apache Spark
                            
                                What is scheduler delay in spark UI's event timeline
                            
                                Why does Complete output mode require aggregation?
                            
                                Spark Word2vec vector mathematics
                            
                                EMR Spark - TransportClient: Failed to send RPC
                            
                                Spark: Why does Python significantly outperform Scala in my use case?
                            
                                How to find the most recent partition in HIVE table
                            
                                Extracting `Seq[(String,String,String)]` from spark DataFrame
                            
                                Spark without Hadoop: Failed to Launch
                            
                                converting pandas dataframes to spark dataframe in zeppelin
                            
                                Getting NullPointerException when running Spark Code in Zeppelin 0.7.1
                            
                                Creating Spark dataframe from numpy matrix
                            
                                Why does Spark Planner prefer sort merge join over shuffled hash join?
                            
                                Kafka topic partitions to Spark streaming
                            
                                java.lang.NoClassDefFoundError: org/apache/spark/streaming/twitter/TwitterUtils$ while running TwitterPopularTags
                            
                                Why does Spark job fail with "Exit code: 52"
                            
                                How to explode columns?
                            
                                Spark SQL SaveMode.Overwrite, getting java.io.FileNotFoundException and requiring 'REFRESH TABLE tableName'
                            
                                How to get word details from TF Vector RDD in Spark ML Lib?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With