Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark: optimise writing a DataFrame to SQL Server

I am using the code below to write a DataFrame of 43 columns and about 2,000,000 rows into a table in SQL Server:

dataFrame
  .write
  .format("jdbc")
  .mode("overwrite")
  .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
  .option("url", url)
  .option("dbtable", tablename)
  .option("user", user)
  .option("password", password)
  .save()

Sadly, while it does work for small DataFrames it's either extremely slow or gets timed out for large ones. Any hints on how to optimize it?

I've tried setting rewriteBatchedStatements=true

Thanks.

like image 293
Dawid Avatar asked Apr 16 '19 12:04

Dawid


People also ask

Which is faster Spark SQL or DataFrame?

Pandas DataFrame is Mutable. Complex operations are difficult to perform as compared to Pandas DataFrame. Complex operations are easier to perform as compared to Spark DataFrame. Spark DataFrame is distributed and hence processing in the Spark DataFrame is faster for a large amount of data.

How do I convert a Spark DataFrame to a table in SQL?

To write data from a Spark DataFrame into a SQL Server table, we need a SQL Server JDBC connector. Also, we need to provide basic configuration property values like connection string, user name, and password as we did while reading the data from SQL Server.


2 Answers

In order improve the performance using PY-Spark (due to Administrative restrictions to use python, SQL and R only) one can use below options.

Method 1: Using JDBC Connector

This method reads or writes the data row by row, resulting in performance issues. Not Recommended.

df.write \
.format("jdbc") \
.mode("overwrite or append") \
.option("url", url) \
.option("dbtable", table_name) \
.option("user", username) \
.option("password", password) \
.save()

Method 2: Using Apache Spark connector (SQL Server & Azure SQL)

This method uses bulk insert to read/write data. There are a lot more options that can be further explored.

First Install the Library using Maven Coordinate in the Data-bricks cluster, and then use the below code.

Recommended for Azure SQL DB or Sql Server Instance

https://docs.microsoft.com/en-us/sql/connect/spark/connector?view=sql-server-ver15

df.write \
.format("com.microsoft.sqlserver.jdbc.spark") \
.mode("overwrite or append") \
.option("url", url) \
.option("dbtable", table_name) \
.option("user", username) \
.option("password", password) \
.option("batchsize", as per need) \
.option("mssqlIsolationLevel", "READ_UNCOMMITTED")\
.save()

Method 3: Using Connector for Azure Dedicated SQL Pool (formerly SQL DW)

This method previously uses Poly-base to read and write data to and from Azure Synapse using a staging server (mainly, blob storage or a Data Lake storage directory), but now data are being read and write using Copy, as the Copy method has improved performance.

Recommended for Azure Synapse

https://docs.databricks.com/data/data-sources/azure/synapse-analytics.html

df.write \
.format("com.databricks.spark.sqldw") \
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "<your-table-name>") \
.option("tempDir", "wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net/<your-directory-name>") \
.save()
like image 160
Techno_Eagle Avatar answered Sep 28 '22 10:09

Techno_Eagle


Try adding batchsize option to your statement with atleast > 10000(change this value accordingly to get better performance) and execute the write again.

From spark docs:

The JDBC batch size, which determines how many rows to insert per round trip. This can help performance on JDBC drivers. This option applies only to writing. It defaults to 1000.

Also its worth to check out:

  • numPartitions option to increase the parallelism (This also determines the maximum number of concurrent JDBC connections)

  • queryTimeout option to increase the timeouts for the write option.

like image 31
notNull Avatar answered Sep 28 '22 11:09

notNull