Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark - how to initialize common DataFrameReader options separately?

I'm reading data with the same options multiple times. Is there a way to avoid duplicating common DataFrameReader options and somehow initialize them separately to use them on each read later?

    metrics_df = spark.read.format("jdbc") \
        .option("driver", self.driver) \
        .option("url", self.url) \
        .option("user", self.username) \
        .option("password", self.password) \
        .load()
like image 427
samba Avatar asked Oct 15 '25 15:10

samba


1 Answers

Define all your options for dataframereader i.e.<class 'pyspark.sql.readwriter.DataFrameReader'> then add dbtable option to reuse the dataframereader.

Example:

metrics_df_options = spark.read.format("jdbc") \
        .option("driver", self.driver) \
        .option("url", self.url) \
        .option("user", self.username) \
        .option("password", self.password)

type(metrics_df_options)
#<class 'pyspark.sql.readwriter.DataFrameReader'>

#configure dbtable and pull data from rdbms table
metrics_df_options.option("dbtable","<table_name>").load().show()
like image 166
notNull Avatar answered Oct 18 '25 06:10

notNull



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!