Calling Spark-submit
will cause the default Ivy logs to display for fetched packages. While relevant for first launch, often caching strategies make logging with Cache Hits not as useful.
What is the best way to disable the logs?
Don't want to see things like:
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/usr/local/spark-2.0.2-bin-hadoop2.4/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.amazonaws#aws-java-sdk added as a dependency
org.apache.hadoop#hadoop-aws added as a dependency
...
Solution
On Spark 3.0+ the original answer doesn't work. I have spent an unreasonable amount of time trying to hide the Ivy startup messages and this is only thing that worked:
with patch(
"pyspark.java_gateway.Popen",
side_effect=lambda *args, **kwargs: Popen(*args, **kwargs, stdout=open(os.devnull, 'wb'), stderr=open(os.devnull, 'wb')),
):
spark: SparkSession = spark.builder.getOrCreate()
This is a very blunt, brittle instrument - it intercepts any calls to stdout and stderr that the jvm process creates during spark startup. Ivy writes most of it's report output to stderr, so you can't get away with just suppressing stdout.
Background on Failed Attempts
Newer versions of spark use log4j2. I was not able to make any perceivable impact by restricting the root logger so dug into the Ivy source code to find out what logging engine it used. It turns out when you drill down far enough it's using System.out.println()
and a custom MessageLogger class - there are no references that I could find to log4j.
Looking at the spark docs, there is a way to override the ivysettings.xml
file, which contains references to the "Report" writer (startup messages). However doing so effectively kills ivy unless you know exactly what to put in there and there was little information on how to change the report output anyway.
Moving on, I next tried to suppress python output of stderr and stdout. This had no impact - the assumption I made was because the spark process runs in a jvm subprocess, outside of the python flow. Thus, patching the call directly was the only way to go.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With