We have a use case where we need to export data from HDFS to a RDBMS. I saw this example . Here they have store the username and password in the code. Is there any way to hide the password while export the data like we have the option of password-alias in Sqoop.
DataFrame API and Datasets API are the ways to interact with Spark SQL.
Apache Spark has multiple ways to read data from different sources like files, databases etc. But when it comes to loading data into RDBMS(relational database management system), Spark supports only Append and Overlay of the data using dataframes. Spark dataframes do not support Updating of data into a database.
Spark SQL also includes a data source that can read data from other databases using JDBC. This functionality should be preferred over using JdbcRDD. This is because the results are returned as a DataFrame and they can easily be processed in Spark SQL or joined with other data sources.
To connect any database connection we require basically the common properties such as database driver , db url , username and password. Hence in order to connect using pyspark code also requires the same set of properties. url — the JDBC url to connect the database.
By passing passwords and secrets as --conf doesn't feel right for couple of reasons:
obfuscated string to invoke same endpoint with malicious intendFew approaches to be more secure
AWS Secrets manager or SSM parameter store or Vault, we could store the credentials. In spark, we could use client libraries like boto3 to fetch the password at run time.--conf based secrets.env variable to determine if it's run in local development environment or on cloud and take necessary action.Setting the password
At the command line as a plaintext spark config:
spark-submit --conf spark.jdbc.password=test_pass ... 
Using environment variable:
export jdbc_password=test_pass_export
spark-submit --conf spark.jdbc.password=$jdbc_password ...
Using spark config properties file:
echo "spark.jdbc.b64password=test_pass_prop" > credentials.properties
spark-submit --properties-file credentials.properties
With base64 encoding to "obfuscate":
echo "spark.jdbc.b64password=$(echo -n test_pass_prop | base64)" > credentials_b64.properties
spark-submit --properties-file credentials_b64.properties
Using the password in code
import java.util.Base64 // for base64
import java.nio.charset.StandardCharsets // for base64
val properties = new java.util.Properties()
properties.put("driver", "com.mysql.jdbc.Driver")
properties.put("url", "jdbc:mysql://mysql-host:3306")
properties.put("user", "test_user")
val password = new String(Base64.getDecoder().decode(spark.conf.get("spark.jdbc.b64password")), StandardCharsets.UTF_8)
properties.put("password", password)
val models = spark.read.jdbc(properties.get("url").toString, "ml_models", properties)
Edit: spark command line interface help docs for --conf and --properties-file:
  --conf PROP=VALUE           Arbitrary Spark configuration property.
  --properties-file FILE      Path to a file from which to load extra properties. If not
                              specified, this will look for conf/spark-defaults.conf.
The properties-file name is arbitrary.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With