I am using the code below to write a DataFrame of 43 columns and about 2,000,000 rows into a table in SQL Server:
dataFrame
.write
.format("jdbc")
.mode("overwrite")
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
.option("url", url)
.option("dbtable", tablename)
.option("user", user)
.option("password", password)
.save()
Sadly, while it does work for small DataFrames it's either extremely slow or gets timed out for large ones. Any hints on how to optimize it?
I've tried setting rewriteBatchedStatements=true
Thanks.
Pandas DataFrame is Mutable. Complex operations are difficult to perform as compared to Pandas DataFrame. Complex operations are easier to perform as compared to Spark DataFrame. Spark DataFrame is distributed and hence processing in the Spark DataFrame is faster for a large amount of data.
To write data from a Spark DataFrame into a SQL Server table, we need a SQL Server JDBC connector. Also, we need to provide basic configuration property values like connection string, user name, and password as we did while reading the data from SQL Server.
In order improve the performance using PY-Spark (due to Administrative restrictions to use python, SQL and R only) one can use below options.
This method reads or writes the data row by row, resulting in performance issues. Not Recommended.
df.write \
.format("jdbc") \
.mode("overwrite or append") \
.option("url", url) \
.option("dbtable", table_name) \
.option("user", username) \
.option("password", password) \
.save()
This method uses bulk insert to read/write data. There are a lot more options that can be further explored.
First Install the Library using Maven Coordinate in the Data-bricks cluster, and then use the below code.
Recommended for Azure SQL DB or Sql Server Instance
https://docs.microsoft.com/en-us/sql/connect/spark/connector?view=sql-server-ver15
df.write \
.format("com.microsoft.sqlserver.jdbc.spark") \
.mode("overwrite or append") \
.option("url", url) \
.option("dbtable", table_name) \
.option("user", username) \
.option("password", password) \
.option("batchsize", as per need) \
.option("mssqlIsolationLevel", "READ_UNCOMMITTED")\
.save()
This method previously uses Poly-base to read and write data to and from Azure Synapse using a staging server (mainly, blob storage or a Data Lake storage directory), but now data are being read and write using Copy, as the Copy
method has improved performance.
Recommended for Azure Synapse
https://docs.databricks.com/data/data-sources/azure/synapse-analytics.html
df.write \
.format("com.databricks.spark.sqldw") \
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "<your-table-name>") \
.option("tempDir", "wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net/<your-directory-name>") \
.save()
Try adding batchsize
option to your statement with atleast > 10000
(change this value accordingly to get better performance) and execute the write again.
From spark docs:
The JDBC batch size, which determines how many rows to insert per round trip. This can help performance on JDBC drivers. This option applies only to writing. It defaults to 1000.
Also its worth to check out:
numPartitions
option
to increase the parallelism (This also determines the maximum number of concurrent JDBC connections)
queryTimeout
option
to increase the timeouts for the write option.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With