Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Insert VARIANT type from Pandas into Snowflake

I'm trying to insert data from a Pandas dataframe into a table in Snowflake, and I'm having trouble figuring out how to do it properly. To start with, I have a created a table in Snowflake that has some columns of type VARIANT. For example:

CREATE OR REPLACE TABLE
    mydatabase.myschema.results(
        DATE date, 
        PRODUCT string, 
        PRODUCT_DETAILS variant, 
        ANALYSIS_META variant,
        PRICE float
)

Then in Pandas, I have a dataframe like this:

import pandas as pd
record = {'DATE': '2020-11-05',
 'PRODUCT': 'blue_banana',
 'PRODUCT_DETAILS': "{'is_blue': True, 'is_kiwi': nan}",
 'ANALYSIS_META': "None",
 'PRICE': 13.02}
df = pd.DataFrame(record, index=[0])

As you see, I've encoded VARIANT columns as strings, as that's what I understood from the snowflake-connector documentation, that a Snowflake VARIANT type maps to str dtype in Pandas and vice-versa.

So, what I've tried to far is the following:

from snowflake.connector import pandas_tools

pandas_tools.write_pandas(
                conn=conn,
                df=df,
                table_name="results",
                schema="myschema",
                database="mydatabase")

And this does work, returning

(True,
 1,
 1,
 [('czeau/file0.txt', 'LOADED', 1, 1, 1, 0, None, None, None, None)])

However, the results I get in Snowflake are not of the proper VARIANT type. The field ANALYSIS_META is correctly NULL, but the field PRODUCT_DETAILS is of type str. See: enter image description here

(also, for example this query throws an error:

SELECT * FROM
MYDATABASE.MYSCHEMA.RESULTS
WHERE PRODUCT_DETAILS:is_blue -- should work for json/variant fields

So with all that, my question is: how should I properly format my Pandas dataframe in order to insert he VARIANT fields correctly as nested fields into a Snowflake table? I thought that casting a dictionary into a string would do the trick, but apparently it doesn't work as I expected. What I am missing here?

like image 918
tania Avatar asked Nov 06 '22 04:11

tania


1 Answers

After some investigation, I found the following solution to work:

1. Ensure that the columns are json-compliant

The key here is that json.dumps will transform your data to the right format (the right quotations, representation of null and such).

import pandas as pd
import json
record = {'DATE': '2020-11-05',
 'PRODUCT': 'blue_banana',
 'PRODUCT_DETAILS': json.dumps({'is_blue': True, 'is_kiwi': None}),
 'ANALYSIS_META': json.dumps(None),
 'PRICE': 13.02}
df = pd.DataFrame(record, index=[0])

2. Ensure you use parse_json and INSERT iteratively

Instead of using write_pandas as tried originally, we can INSERT into the table row by row, making sure to specify parse_json on the columns of desired VARIANT type, while also encoding the value as a string (by putting ' marks around it). The caveat is that this solution would be very slow if you have large amounts of data.

sql = """INSERT INTO MYDATABASE.MYSCHEMA.RESULTS
SELECT
 to_date('{DATE}'),
 '{PRODUCT}',
 parse_json('{PRODUCT_DETAILS}'),
 parse_json('{ANALYSIS_META}'),
 {PRICE}
"""
### CREATE A SNOWFLAKE CONN...

for i, r in df.iterrows():
    conn.cursor().execute(sql.format(**dict(r)))
like image 143
tania Avatar answered Nov 15 '22 10:11

tania