How to protect password and username in Spark (such as for JDBC connections/accessing RDBMS databases)?

Tags:

We have a use case where we need to export data from HDFS to a RDBMS. I saw this example . Here they have store the username and password in the code. Is there any way to hide the password while export the data like we have the option of password-alias in Sqoop.

283

asked Apr 11 '17 03:04

Rajat Mishra

2 Answers

By passing passwords and secrets as --conf doesn't feel right for couple of reasons:

This might be visible in logs
prone to middle man attack
Even if the password is obfuscated, the middle man can use the obfuscated string to invoke same endpoint with malicious intend

Few approaches to be more secure

Use deployment environment based credentials storage and have spark pull the credentials at runtime from them.
For example, if Spark is deployed on AWS, using AWS Secrets manager or SSM parameter store or Vault, we could store the credentials. In spark, we could use client libraries like boto3 to fetch the password at run time.
During development, this step can be avoided. In Spark logic, we could check for password in conf. If it's not there, we could check cloud provider based secret manager. So only in local box, we will use --conf based secrets.
Another option is to have env variable to determine if it's run in local development environment or on cloud and take necessary action.

104

answered Oct 04 '22 12:10

Sairam Krish

Setting the password

At the command line as a plaintext spark config:

spark-submit --conf spark.jdbc.password=test_pass ...

Using environment variable:

export jdbc_password=test_pass_export
spark-submit --conf spark.jdbc.password=$jdbc_password ...

Using spark config properties file:

echo "spark.jdbc.b64password=test_pass_prop" > credentials.properties
spark-submit --properties-file credentials.properties

With base64 encoding to "obfuscate":

echo "spark.jdbc.b64password=$(echo -n test_pass_prop | base64)" > credentials_b64.properties
spark-submit --properties-file credentials_b64.properties

Using the password in code

import java.util.Base64 // for base64
import java.nio.charset.StandardCharsets // for base64
val properties = new java.util.Properties()
properties.put("driver", "com.mysql.jdbc.Driver")
properties.put("url", "jdbc:mysql://mysql-host:3306")
properties.put("user", "test_user")
val password = new String(Base64.getDecoder().decode(spark.conf.get("spark.jdbc.b64password")), StandardCharsets.UTF_8)
properties.put("password", password)
val models = spark.read.jdbc(properties.get("url").toString, "ml_models", properties)

Edit: spark command line interface help docs for --conf and --properties-file:

  --conf PROP=VALUE           Arbitrary Spark configuration property.
  --properties-file FILE      Path to a file from which to load extra properties. If not
                              specified, this will look for conf/spark-defaults.conf.

The properties-file name is arbitrary.

answered Oct 04 '22 10:10

Garren S

Related questions
                            
                                Spark: error reading DateType columns in partitioned parquet data
                            
                                Apache Spark shell crashes when trying to start executor on worker
                            
                                Spark RDD equivalent to Scala collections partition
                            
                                ON DUPLICATE KEY UPDATE while inserting from pyspark dataframe to an external database table via JDBC
                            
                                Why spark executor receives SIGTERM?
                            
                                Spark ML - MulticlassClassificationEvaluator - can we get precision/recall by each class label?
                            
                                Is proper event-time sessionization possible with Spark Structured Streaming?
                            
                                Python Spark Dataframes: Better way to export groups to text file
                            
                                Proper save/load of MatrixFactorizationModel
                            
                                How does Spark send closures to workers?
                            
                                Pyspark: applying kmeans on different groups of a dataframe
                            
                                Structured streaming - Metrics in Grafana
                            
                                Spark accumulator not displayed in spark WebUI
                            
                                how to redirect Scala Spark Dataset.show to log4j logger
                            
                                Applying Python function to Pandas grouped DataFrame - what's the most efficient approach to speed up the computations?
                            
                                Using SparkR JVM to call methods from a Scala jar file
                            
                                Sorting JavaPairRDD first by value and then by key

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to protect password and username in Spark (such as for JDBC connections/accessing RDBMS databases)?

Tags:

apache-spark

apache-spark-sql

Rajat Mishra

People also ask

2 Answers

Sairam Krish

Garren S

Recent Activity

Donate For Us