Spark: optimise writing a DataFrame to SQL Server

Tags:

I am using the code below to write a DataFrame of 43 columns and about 2,000,000 rows into a table in SQL Server:

dataFrame
  .write
  .format("jdbc")
  .mode("overwrite")
  .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
  .option("url", url)
  .option("dbtable", tablename)
  .option("user", user)
  .option("password", password)
  .save()

Sadly, while it does work for small DataFrames it's either extremely slow or gets timed out for large ones. Any hints on how to optimize it?

I've tried setting rewriteBatchedStatements=true

Thanks.

293

asked Apr 16 '19 12:04

Dawid

2 Answers

In order improve the performance using PY-Spark (due to Administrative restrictions to use python, SQL and R only) one can use below options.

Method 1: Using JDBC Connector

This method reads or writes the data row by row, resulting in performance issues. Not Recommended.

df.write \
.format("jdbc") \
.mode("overwrite or append") \
.option("url", url) \
.option("dbtable", table_name) \
.option("user", username) \
.option("password", password) \
.save()

Method 2: Using Apache Spark connector (SQL Server & Azure SQL)

This method uses bulk insert to read/write data. There are a lot more options that can be further explored.

First Install the Library using Maven Coordinate in the Data-bricks cluster, and then use the below code.

Recommended for Azure SQL DB or Sql Server Instance

https://docs.microsoft.com/en-us/sql/connect/spark/connector?view=sql-server-ver15

df.write \
.format("com.microsoft.sqlserver.jdbc.spark") \
.mode("overwrite or append") \
.option("url", url) \
.option("dbtable", table_name) \
.option("user", username) \
.option("password", password) \
.option("batchsize", as per need) \
.option("mssqlIsolationLevel", "READ_UNCOMMITTED")\
.save()

Method 3: Using Connector for Azure Dedicated SQL Pool (formerly SQL DW)

This method previously uses Poly-base to read and write data to and from Azure Synapse using a staging server (mainly, blob storage or a Data Lake storage directory), but now data are being read and write using Copy, as the Copy method has improved performance.

Recommended for Azure Synapse

https://docs.databricks.com/data/data-sources/azure/synapse-analytics.html

df.write \
.format("com.databricks.spark.sqldw") \
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "<your-table-name>") \
.option("tempDir", "wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net/<your-directory-name>") \
.save()

160

answered Sep 28 '22 10:09

Techno_Eagle

Try adding batchsize option to your statement with atleast > 10000(change this value accordingly to get better performance) and execute the write again.

From spark docs:

The JDBC batch size, which determines how many rows to insert per round trip. This can help performance on JDBC drivers. This option applies only to writing. It defaults to 1000.

Also its worth to check out:

numPartitions option to increase the parallelism (This also determines the maximum number of concurrent JDBC connections)
queryTimeout option to increase the timeouts for the write option.

answered Sep 28 '22 11:09

notNull

Related questions
                            
                                Putting JSON string as field data on MySQL
                            
                                How to find top-X highest values in column using Django Queryset without cutting off ties at the bottom?
                            
                                SQL exercises/queries with sample database [closed]
                            
                                ERROR: failed to find conversion function from unknown to text
                            
                                Missing IN or OUT parameter at index:: 1 error in Java, Oracle
                            
                                Database index on a column with duplicate values
                            
                                Update query if statement for Oracle
                            
                                Get pointer to a struct field value
                            
                                Cut string after first occurrence of a character
                            
                                postgresql update multiple tables in single query
                            
                                ExecuteReader() in Powershell script
                            
                                Remove duplicate sub-query
                            
                                How to filter rows for a specific aggregate with spark sql?
                            
                                ORA-32795: cannot insert into a generated always identity column
                            
                                How to aggregate over rolling time window with groups in Spark
                            
                                SSDT failing to publish: "Unable to connect to master or target server"
                            
                                How to find duplicate rows in Hive?
                            
                                COUNT(*) FILTER (WHERE... In BigQuery
                            
                                How to drop a column from a Databricks Delta table?
                            
                                How to get distinct count of values of all the columns of a table based on where condition in sql server?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark: optimise writing a DataFrame to SQL Server

Tags:

sql

database

sql-server

scala

apache-spark