Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Azure Databricks to Azure SQL DW: Long text columns

I would like to populate an Azure SQL DW from an Azure Databricks notebook environment. I am using the built-in connector with pyspark:

sdf.write \
  .format("com.databricks.spark.sqldw") \
  .option("forwardSparkAzureStorageCredentials", "true") \
  .option("dbTable", "test_table") \
  .option("url", url) \
  .option("tempDir", temp_dir) \
  .save()

This works fine, but I get an error when I include a string column with a sufficiently long content. I get the following error:

Py4JJavaError: An error occurred while calling o1252.save. : com.databricks.spark.sqldw.SqlDWSideException: SQL DW failed to execute the JDBC query produced by the connector.

Underlying SQLException(s): - com.microsoft.sqlserver.jdbc.SQLServerException: HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: HadoopSqlException: String or binary data would be truncated. [ErrorCode = 107090] [SQLState = S0001]

As I understand it, this is because the default string type is NVARCHAR(256). It is possible to configure (reference), but the maximum NVARCHAR length is 4k characters. My strings occasionally reach 10k characters. Therefore, I am curious as to how I can export certain columns as text/longtext instead.

I would guess that the following would work, if only the preActions were executed after table was created. It's not, and therefore it fails.

sdf.write \
  .format("com.databricks.spark.sqldw") \
  .option("forwardSparkAzureStorageCredentials", "true") \
  .option("dbTable", "test_table") \
  .option("url", url) \
  .option("tempDir", temp_dir) \
  .option("preActions", "ALTER TABLE test_table ALTER COLUMN value NVARCHAR(MAX);") \
  .save()

Also, postActions are executed after data is inserted, and therefore this will also fail.

Any ideas?

like image 929
casparjespersen Avatar asked Oct 15 '22 05:10

casparjespersen


1 Answers

I had a similar problem and was able to resolve it using the options:

.option("maxStrLength",4000)

Thus in your example this would be:

sdf.write \
  .format("com.databricks.spark.sqldw") \
  .option("forwardSparkAzureStorageCredentials", "true") \
  .option("dbTable", "test_table") \
  .option("maxStrLength",4000)\
  .option("url", url) \
  .option("tempDir", temp_dir) \
  .save()

This is documented here:

"StringType in Spark is mapped to the NVARCHAR(maxStrLength) type in Azure Synapse. You can use maxStrLength to set the string length for all NVARCHAR(maxStrLength) type columns that are in the table with name dbTable in Azure Synapse."

If your strings go over 4k then you should:

Pre-define your table column with NVARCHAR(MAX) and then write in append mode to the table. In this case you can't use the default columnstore index so either use a HEAP or set proper indexes. A lazy heap would be:

CREATE TABLE example.table
(
    NormalColumn NVARCHAR(256),
    LongColumn NVARCHAR(4000),
    VeryLongColumn NVARCHAR(MAX)
) 
WITH (HEAP)

Then you can write to it as usual, without the maxStrLength option. This also means you don't overspecify all other string columns.

Other options are to:

  1. use split to convert 1 column into several string columns.
  2. save as parquet and then load in from inside synapse
like image 130
jabberwocky Avatar answered Oct 21 '22 05:10

jabberwocky