The goal of this question is to document: <ul> <li>steps required to read and write data using JDBC connections in PySpark</li> <li>possible issues with JDBC sources and know solutions</li> </ul> With small changes these methods should work with other supported languages including Scala and R.

<h3>Writing data</h3> <ol> <li> Include applicable JDBC driver when you submit the application or start shell. You can use for example <code>--packages</code>: <pre class="prettyprint"><code> bin/pyspark --packages group:name:version </code></pre> </li> </ol> or combining <code>driver-class-path</code> and <code>jars</code> <pre class="prettyprint"><code> bin/pyspark --driver-class-path $PATH_TO_DRIVER_JAR --jars $PATH_TO_DRIVER_JAR </code></pre> These properties can be also set using <code>PYSPARK_SUBMIT_ARGS</code> environment variable before JVM instance has been started or using <code>conf/spark-defaults.conf</code> to set <code>spark.jars.packages</code> or <code>spark.jars</code> / <code>spark.driver.extraClassPath</code>. <ol start="2"> <li> Choose desired mode. Spark JDBC writer supports following modes: <blockquote> <ul> <li> <code>append</code>: Append contents of this :class:<code>DataFrame</code> to existing data.</li> <li> <code>overwrite</code>: Overwrite existing data.</li> <li> <code>ignore</code>: Silently ignore this operation if data already exists.</li> <li> <code>error</code> (default case): Throw an exception if data already exists.</li> </ul> </blockquote> Upserts or other fine-grained modifications are not supported <pre class="prettyprint"><code> mode = ... </code></pre> </li> <li> Prepare JDBC URI, for example: <pre class="prettyprint"><code> # You can encode credentials in URI or pass # separately using properties argument # of jdbc method or options url = "jdbc:postgresql://localhost/foobar" </code></pre> </li> <li> (Optional) Create a dictionary of JDBC arguments. <pre class="prettyprint"><code> properties = { "user": "foo", "password": "bar" } </code></pre> <code>properties</code> / <code>options</code> can be also used to set supported JDBC connection properties. </li> <li> Use <code>DataFrame.write.jdbc</code> <pre class="prettyprint"><code> df.write.jdbc(url=url, table="baz", mode=mode, properties=properties) </code></pre> </li> </ol> to save the data (see <code>pyspark.sql.DataFrameWriter</code> for details). Known issues: <ul> <li> Suitable driver cannot be found when driver has been included using <code>--packages</code> (<code>java.sql.SQLException: No suitable driver found for jdbc: ...</code>) Assuming there is no driver version mismatch to solve this you can add <code>driver</code> class to the <code>properties</code>. For example: <pre class="prettyprint"><code> properties = { ... "driver": "org.postgresql.Driver" } </code></pre> </li> <li> using <code>df.write.format("jdbc").options(...).save()</code> may result in: <blockquote> java.lang.RuntimeException: org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not allow create table as select. </blockquote> Solution unknown. </li> <li> in Pyspark 1.3 you can try calling Java method directly: <pre class="prettyprint"><code> df._jdf.insertIntoJDBC(url, "baz", True) </code></pre> </li> </ul> <h3>Reading data</h3> <ol> <li> Follow steps 1-4 from Writing data </li> <li> Use <code>sqlContext.read.jdbc</code>: <pre class="prettyprint"><code> sqlContext.read.jdbc(url=url, table="baz", properties=properties) </code></pre> </li> </ol> or <code>sqlContext.read.format("jdbc")</code>: <pre class="prettyprint"><code> (sqlContext.read.format("jdbc") .options(url=url, dbtable="baz", **properties) .load()) </code></pre> Known issues and gotchas: <ul> <li> Suitable driver cannot be found - see: Writing data </li> <li> Spark SQL supports predicate pushdown with JDBC sources although not all predicates can pushed down. It also doesn't delegate limits nor aggregations. Possible workaround is to replace <code>dbtable</code> / <code>table</code> argument with a valid subquery. See for example: <ul> <li>Does spark predicate pushdown work with JDBC?</li> <li>More than one hour to execute pyspark.sql.DataFrame.take(4)</li> <li>How to use SQL query to define table in dbtable?</li> </ul> </li> <li> By default JDBC data sources loads data sequentially using a single executor thread. To ensure distributed data loading you can: <ul> <li>Provide partitioning <code>column</code> (must be <code>IntegerType</code>), <code>lowerBound</code>, <code>upperBound</code>, <code>numPartitions</code>.</li> <li>Provide a list of mutually exclusive predicates <code>predicates</code>, one for each desired partition.</li> </ul> See: <ul> <li> Partitioning in spark while reading from RDBMS via JDBC,</li> <li> How to optimize partitioning when migrating data from JDBC source?,</li> <li>How to improve performance for slow Spark jobs using DataFrame and JDBC connection?</li> <li>How to partition Spark RDD when importing Postgres using JDBC?</li> </ul> </li> <li> In a distributed mode (with partitioning column or predicates) each executor operates in its own transaction. If the source database is modified at the same time there is no guarantee that the final view will be consistent. </li> </ul> <h3>Where to find suitable drivers:</h3> <ul> <li> Maven Repository (to obtain required coordinates for <code>--packages</code> select desired version and copy data from a Gradle tab in a form <code>compile-group:name:version</code> substituting respective fields) or Maven Central Repository: <ul> <li>PostgreSQL</li> <li>MySQL</li> </ul> </li> </ul> <h3>Other options</h3> Depending on the database specialized source might exist, and be preferred in some cases: <ul> <li>Greenplum - Pivotal Greenplum-Spark Connector </li> <li>Apache Phoenix - Apache Spark Plugin </li> <li>Microsoft SQL Server - Spark connector for Azure SQL Databases and SQL Server </li> <li>Amazon Redshift - Databricks Redshift connector (current versions available only in a proprietary Databricks Runtime. Discontinued open source version, available on GitHub).</li> </ul>

How to use JDBC source to write and read data in (Py)Spark?

1 Answers

Writing data

Include applicable JDBC driver when you submit the application or start shell. You can use for example --packages:
```
 bin/pyspark --packages group:name:version   
```

or combining driver-class-path and jars

    bin/pyspark --driver-class-path $PATH_TO_DRIVER_JAR --jars $PATH_TO_DRIVER_JAR

These properties can be also set using PYSPARK_SUBMIT_ARGS environment variable before JVM instance has been started or using conf/spark-defaults.conf to set spark.jars.packages or spark.jars / spark.driver.extraClassPath.

Choose desired mode. Spark JDBC writer supports following modes:
- append: Append contents of this :class:DataFrame to existing data.
- overwrite: Overwrite existing data.
- ignore: Silently ignore this operation if data already exists.
- error (default case): Throw an exception if data already exists.
Upserts or other fine-grained modifications are not supported
```
 mode = ... 
```

Prepare JDBC URI, for example:

 # You can encode credentials in URI or pass  # separately using properties argument  # of jdbc method or options   url = "jdbc:postgresql://localhost/foobar"

(Optional) Create a dictionary of JDBC arguments.
```
 properties = {      "user": "foo",      "password": "bar"  } 
```
properties / options can be also used to set supported JDBC connection properties.

Use DataFrame.write.jdbc

 df.write.jdbc(url=url, table="baz", mode=mode, properties=properties)

to save the data (see pyspark.sql.DataFrameWriter for details).

Known issues:

Suitable driver cannot be found when driver has been included using --packages (java.sql.SQLException: No suitable driver found for jdbc: ...)

Assuming there is no driver version mismatch to solve this you can add driver class to the properties. For example:
```
  properties = {       ...       "driver": "org.postgresql.Driver"   } 
```
using df.write.format("jdbc").options(...).save() may result in:

java.lang.RuntimeException: org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not allow create table as select.

Solution unknown.
in Pyspark 1.3 you can try calling Java method directly:
```
  df._jdf.insertIntoJDBC(url, "baz", True) 
```

Reading data

Follow steps 1-4 from Writing data

Use sqlContext.read.jdbc:

 sqlContext.read.jdbc(url=url, table="baz", properties=properties)

or sqlContext.read.format("jdbc"):

    (sqlContext.read.format("jdbc")         .options(url=url, dbtable="baz", **properties)         .load())

Known issues and gotchas:

Suitable driver cannot be found - see: Writing data
Spark SQL supports predicate pushdown with JDBC sources although not all predicates can pushed down. It also doesn't delegate limits nor aggregations. Possible workaround is to replace dbtable / table argument with a valid subquery. See for example:
- Does spark predicate pushdown work with JDBC?
- More than one hour to execute pyspark.sql.DataFrame.take(4)
- How to use SQL query to define table in dbtable?
By default JDBC data sources loads data sequentially using a single executor thread. To ensure distributed data loading you can:
- Provide partitioning column (must be IntegerType), lowerBound, upperBound, numPartitions.
- Provide a list of mutually exclusive predicates predicates, one for each desired partition.
See:
- Partitioning in spark while reading from RDBMS via JDBC,
- How to optimize partitioning when migrating data from JDBC source?,
- How to improve performance for slow Spark jobs using DataFrame and JDBC connection?
- How to partition Spark RDD when importing Postgres using JDBC?
In a distributed mode (with partitioning column or predicates) each executor operates in its own transaction. If the source database is modified at the same time there is no guarantee that the final view will be consistent.

Where to find suitable drivers:

Maven Repository (to obtain required coordinates for --packages select desired version and copy data from a Gradle tab in a form compile-group:name:version substituting respective fields) or Maven Central Repository:
- PostgreSQL
- MySQL

Other options

Depending on the database specialized source might exist, and be preferred in some cases:

Greenplum - Pivotal Greenplum-Spark Connector
Apache Phoenix - Apache Spark Plugin
Microsoft SQL Server - Spark connector for Azure SQL Databases and SQL Server
Amazon Redshift - Databricks Redshift connector (current versions available only in a proprietary Databricks Runtime. Discontinued open source version, available on GitHub).

answered Sep 18 '22 04:09

zero323

Related questions
                            
                                Extracting an information from web page by machine learning
                            
                                store return value of a Python script in a bash script
                            
                                what is --use-feature=2020-resolver? error message with jupyter installation on ubuntu
                            
                                Validating a yaml document in python
                            
                                parameter unsupported when inserting int in sqlite
                            
                                Why is code using intermediate variables faster than code without?
                            
                                Serving a front end created with create-react-app with Flask
                            
                                Understanding __init_subclass__
                            
                                Numpy ‘smart’ symmetric matrix
                            
                                Process finished with exit code 137 in PyCharm
                            
                                mixed slashes with os.path.join on windows
                            
                                What is the problem with reduce()?
                            
                                How do I run Python script using arguments in windows command line
                            
                                bit-wise operation unary ~ (invert)
                            
                                How to get the list of options that Python was compiled with?
                            
                                Python object.__repr__(self) should be an expression?
                            
                                Are locks unnecessary in multi-threaded Python code because of the GIL?
                            
                                Python 3 string.join() equivalent?
                            
                                Fail to get data on using read() of StringIO in python
                            
                                How to assert that an iterable is not empty on Unittest?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to use JDBC source to write and read data in (Py)Spark?

Tags:

python

scala

apache-spark

apache-spark-sql

pyspark

zero323

People also ask

1 Answers

Writing data

Reading data

Where to find suitable drivers:

Other options

zero323

Recent Activity

Donate For Us