In (py)spark 2.4, it is possible to redact some sensitive informations from the event logs, for exemple:
.config("spark.eventLog.enabled", "true") \
.config("spark.eventLog.dir", "hdfs:///tmp/spark-events") \
.config("spark.redaction.regex", os.environ["SPARK_REDACTION_REGEX"]) \
This would remove "all" informations from the spark event logs, at least from the SparkListenerEnvironmentUpdate
event.
However, when checking the event file, there are still some sensitive data, matching the regex, that are not redacted.
For example, in the SparkListenerJobStart
event.
How would I "redact" ALL the informations, for ALL the events ?
It does not seem possible in (py)spark 2. However, This is fixed in spark 3.1. Indeed, all variables matching the redaction regex are correctly redacted.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With