Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In spark (2.4 and above), how to completely "redact" ALL sensitive information

In (py)spark 2.4, it is possible to redact some sensitive informations from the event logs, for exemple:

    .config("spark.eventLog.enabled", "true") \
    .config("spark.eventLog.dir", "hdfs:///tmp/spark-events") \
    .config("spark.redaction.regex", os.environ["SPARK_REDACTION_REGEX"]) \

This would remove "all" informations from the spark event logs, at least from the SparkListenerEnvironmentUpdate event.

However, when checking the event file, there are still some sensitive data, matching the regex, that are not redacted.

For example, in the SparkListenerJobStart event.

How would I "redact" ALL the informations, for ALL the events ?

like image 739
Itération 122442 Avatar asked Sep 02 '25 02:09

Itération 122442


1 Answers

It does not seem possible in (py)spark 2. However, This is fixed in spark 3.1. Indeed, all variables matching the redaction regex are correctly redacted.

like image 160
Itération 122442 Avatar answered Sep 09 '25 09:09

Itération 122442