Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

can't load parquet timestamp using Synapse serverless pool's OPENROWSET

TL;DR

The first possible parquet timestamp, 0001-01-01 00:00:00.000 cannot be loaded with Azure Synapse Serverless SQL Pools, as DATETIME2 or any other type. I'm baffled bc the DATETIME2 docs say that this is within the acceptable range:

Property Value
Date range 0001-01-01 through 9999-12-31 January 1,1 CE through December 31, 9999 CE
Time range 00:00:00 through 23:59:59.9999999

user @RandolphWest found this relevant thread that gets the same error message but they lay the blame on ADF.

Reproducible Example

Here's how to make a file that will reproduce the error. Alternatively, here's a Dropbox link to the file

Make a parquet file

from pyspark.sql.functions import to_timestamp
dfx = (
    spark.createDataFrame(
        data = [ ("1","0001-01-01 00:00:00.000")],
        schema=["id","my_ts"])
    .withColumn("my_ts",to_timestamp("my_ts"))
)
dfx.write.parquet('abfss://<MY-CONTAINER>@<MY-WORKSPACE>.dfs.core.windows.net/dfx')
display(dfx)
+---+-------------------+
| id|              my_ts|
+---+-------------------+
|  1|0001-01-01 00:00:00|
+---+-------------------+

Load a parquet file

SELECT
    TOP 100 *
FROM
    OPENROWSET(
        BULK 'https://<MY-WORKSPACE>.dfs.core.windows.net/<MY-CONTAINER>/dfx/**',
        FORMAT = 'PARQUET'
    ) AS [result]

Error message

Failed to execute query.
Error: Error handling external file: 'Inserting value to batch for column type DATETIME2 failed.
Invalid argument provided.'.
File/External table name: 'https://<MY-WORKSPACE>.dfs.core.windows.net/<MY-CONTAINER>/dfx/part-00015-f98c11bd-ffb4-45de-b505-0aa226183098-c000.snappy.parquet'.
Statement ID: {F3062AA6-DD96-4271-A093-6D24F8E0C2FE} | Query hash: 0x9CCB08D8E9D29322 | Distributed request ID: {D0BA926E-FAB2-47DD-8309-BE91DFAC48F8}.
Total size of data scanned is 1 megabytes, total size of data moved is 0 megabytes, total size of data written is 0 megabytes. 
like image 885
Anders Swanson Avatar asked Jan 26 '26 20:01

Anders Swanson


1 Answers

I had the same error, I've solved this issue saving as datatype string (so do not use to_timestamp) in the adls and then converting to datetime2 into the Synapse query. This sound strange but it actually works. using your code it will be something like this:

SELECT
    cast(my_ts as datetime2) as my_ts
FROM
    OPENROWSET(
        BULK 'https://<MY-WORKSPACE>.dfs.core.windows.net/<MY-CONTAINER>/dfx/**',
        FORMAT = 'PARQUET'
    ) WITH (my_ts as nvarchar(27)) AS [result]
like image 153
Nicola Avatar answered Jan 29 '26 09:01

Nicola



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!