TL;DR is it possible to suppress individual Spark logging messages without clobbering all logging?
I'm running a Spark Streaming job on EMR, and getting logging messages like:
17/08/17 21:09:00 INFO TaskSetManager: Finished task 101.0 in stage 5259.0 (TID 315581) in 17 ms on ip-172-31-37-216.ec2.internal (107/120)
17/08/17 21:09:00 INFO MapPartitionsRDD: Removing RDD 31559 from persistence list
17/08/17 21:09:00 INFO DAGScheduler: Job 2629 finished: foreachPartition at StreamingSparkJob.scala:52, took 0.080085 s
17/08/17 21:09:00 INFO DAGScheduler: ResultStage 5259 (foreachPartition at StreamingSparkJob.scala:52) finished in 0.077 s
17/08/17 21:09:00 INFO JobScheduler: Total delay: 0.178 s for time 1503004140000 ms (execution: 0.084 s)
None of which is helpful, at this stage of development, and which masks real logging that is emitted deliberately by my application. I would like to stop Spark from emitting these log messages, or to suppress their recording.
AWS Customer Support, and various answers (e.g.) suggest that this can be achieved by passing some JSON configuration on cluster creation. However, since this is a streaming job (for which the cluster would, ideally, remain up forever and just get redeployed-to), I'd like to find some way to configure this via spark-submit
options.
Other responses (e.g., e.g.) suggest that this can be done by submitting a log4j.properties
file which sets log4j.rootCategory=WARN, <appender>
. However, this link suggests that rootCategory
is the same thing as rootLogger
, so I would interpret this as limiting all logging (not just Spark's) to WARN
- and, indeed, when I deployed a change doing this, that was what was observed.
I note that the final paragraph of here says "Spark uses log4j for logging. You can configure it by adding a log4j.properties
file in the conf
directory. One way to start is to copy the existing log4j.properties.template
located there.". I'm about to experiment with this to see whether this will suppress the INFO
logs that fill up our logging. However, this is still not an ideal solution, because there are some INFO
logs that Spark emits that are useful - for instance, when it records the number of files that were picked up (from S3) by each streaming iteration. So, what I'd ideally like would be one of:
INFO
logsDo either of these exist?
(To address a possible response - I'm loath to only emit logs from my own application at WARN
and above)
you could control logs by logger namesparce from log4j.properties, here is an example:
log4j.rootLogger=WARN, console
# add a ConsoleAppender to the logger stdout to write to the console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.out
log4j.appender.console.layout=org.apache.log4j.PatternLayout
# use a simple message format
log4j.appender.console.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n
# set the log level for these components
log4j.logger.org.apache.spark=WARN
log4j.logger.org.spark-project=ERROR
log4j.logger.org.apache.hadoop=ERROR
log4j.logger.io.netty=ERROR
log4j.logger.org.apache.zookeeper=ERROR
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With