Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to protect password and username in Spark (such as for JDBC connections/accessing RDBMS databases)?

We have a use case where we need to export data from HDFS to a RDBMS. I saw this example . Here they have store the username and password in the code. Is there any way to hide the password while export the data like we have the option of password-alias in Sqoop.

like image 283
Rajat Mishra Avatar asked Apr 11 '17 03:04

Rajat Mishra


People also ask

What methods are used by Spark SQL to connect to databases?

DataFrame API and Datasets API are the ways to interact with Spark SQL.

Can Spark connect to RDBMS?

Apache Spark has multiple ways to read data from different sources like files, databases etc. But when it comes to loading data into RDBMS(relational database management system), Spark supports only Append and Overlay of the data using dataframes. Spark dataframes do not support Updating of data into a database.

How does JDBC work in Spark?

Spark SQL also includes a data source that can read data from other databases using JDBC. This functionality should be preferred over using JdbcRDD. This is because the results are returned as a DataFrame and they can easily be processed in Spark SQL or joined with other data sources.

What are connector used for database connection in Pyspark?

To connect any database connection we require basically the common properties such as database driver , db url , username and password. Hence in order to connect using pyspark code also requires the same set of properties. url — the JDBC url to connect the database.


2 Answers

By passing passwords and secrets as --conf doesn't feel right for couple of reasons:

  • This might be visible in logs
  • prone to middle man attack
  • Even if the password is obfuscated, the middle man can use the obfuscated string to invoke same endpoint with malicious intend

Few approaches to be more secure

  • Use deployment environment based credentials storage and have spark pull the credentials at runtime from them.
  • For example, if Spark is deployed on AWS, using AWS Secrets manager or SSM parameter store or Vault, we could store the credentials. In spark, we could use client libraries like boto3 to fetch the password at run time.
  • During development, this step can be avoided. In Spark logic, we could check for password in conf. If it's not there, we could check cloud provider based secret manager. So only in local box, we will use --conf based secrets.
  • Another option is to have env variable to determine if it's run in local development environment or on cloud and take necessary action.
like image 104
Sairam Krish Avatar answered Oct 04 '22 12:10

Sairam Krish


Setting the password

At the command line as a plaintext spark config:

spark-submit --conf spark.jdbc.password=test_pass ... 

Using environment variable:

export jdbc_password=test_pass_export
spark-submit --conf spark.jdbc.password=$jdbc_password ...

Using spark config properties file:

echo "spark.jdbc.b64password=test_pass_prop" > credentials.properties
spark-submit --properties-file credentials.properties

With base64 encoding to "obfuscate":

echo "spark.jdbc.b64password=$(echo -n test_pass_prop | base64)" > credentials_b64.properties
spark-submit --properties-file credentials_b64.properties

Using the password in code

import java.util.Base64 // for base64
import java.nio.charset.StandardCharsets // for base64
val properties = new java.util.Properties()
properties.put("driver", "com.mysql.jdbc.Driver")
properties.put("url", "jdbc:mysql://mysql-host:3306")
properties.put("user", "test_user")
val password = new String(Base64.getDecoder().decode(spark.conf.get("spark.jdbc.b64password")), StandardCharsets.UTF_8)
properties.put("password", password)
val models = spark.read.jdbc(properties.get("url").toString, "ml_models", properties)

Edit: spark command line interface help docs for --conf and --properties-file:

  --conf PROP=VALUE           Arbitrary Spark configuration property.
  --properties-file FILE      Path to a file from which to load extra properties. If not
                              specified, this will look for conf/spark-defaults.conf.

The properties-file name is arbitrary.

like image 29
Garren S Avatar answered Oct 04 '22 10:10

Garren S