We have a use case where we need to export data from HDFS to a RDBMS. I saw this example . Here they have store the username and password in the code. Is there any way to hide the password while export the data like we have the option of password-alias in Sqoop.
DataFrame API and Datasets API are the ways to interact with Spark SQL.
Apache Spark has multiple ways to read data from different sources like files, databases etc. But when it comes to loading data into RDBMS(relational database management system), Spark supports only Append and Overlay of the data using dataframes. Spark dataframes do not support Updating of data into a database.
Spark SQL also includes a data source that can read data from other databases using JDBC. This functionality should be preferred over using JdbcRDD. This is because the results are returned as a DataFrame and they can easily be processed in Spark SQL or joined with other data sources.
To connect any database connection we require basically the common properties such as database driver , db url , username and password. Hence in order to connect using pyspark code also requires the same set of properties. url — the JDBC url to connect the database.
By passing passwords and secrets as --conf
doesn't feel right for couple of reasons:
obfuscated
string to invoke same endpoint with malicious intendFew approaches to be more secure
AWS Secrets manager
or SSM parameter store
or Vault
, we could store the credentials. In spark, we could use client libraries like boto3
to fetch the password at run time.--conf
based secrets.env
variable to determine if it's run in local development environment or on cloud and take necessary action.Setting the password
At the command line as a plaintext spark config:
spark-submit --conf spark.jdbc.password=test_pass ...
Using environment variable:
export jdbc_password=test_pass_export
spark-submit --conf spark.jdbc.password=$jdbc_password ...
Using spark config properties file:
echo "spark.jdbc.b64password=test_pass_prop" > credentials.properties
spark-submit --properties-file credentials.properties
With base64 encoding to "obfuscate":
echo "spark.jdbc.b64password=$(echo -n test_pass_prop | base64)" > credentials_b64.properties
spark-submit --properties-file credentials_b64.properties
Using the password in code
import java.util.Base64 // for base64
import java.nio.charset.StandardCharsets // for base64
val properties = new java.util.Properties()
properties.put("driver", "com.mysql.jdbc.Driver")
properties.put("url", "jdbc:mysql://mysql-host:3306")
properties.put("user", "test_user")
val password = new String(Base64.getDecoder().decode(spark.conf.get("spark.jdbc.b64password")), StandardCharsets.UTF_8)
properties.put("password", password)
val models = spark.read.jdbc(properties.get("url").toString, "ml_models", properties)
Edit: spark command line interface help docs for --conf and --properties-file:
--conf PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If not
specified, this will look for conf/spark-defaults.conf.
The properties-file name is arbitrary.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With