Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimal way to store data from Pandas to Snowflake

The dataframe is huge (7-8 million rows). Tried to_sql with chunksize = 5000 but it never finished.

Using,

from sqlalchemy import create_engine
from snowflake.sqlalchemy import URL

df.to_sql(snowflake_table , engine, if_exists='replace', index=False, index_label=None, chunksize=20000)

What are other optimal solutions for storing data into SF from Pandas DF? Or what am I doing wrong here? The DF is usually of size 7-10 million rows.

like image 628
abdthe1 Avatar asked Aug 08 '18 11:08

abdthe1


2 Answers

The least painful way I can imagine is to dump the file to S3 and have Snowpipe load it into Snowflake automatically. With that set up you don't have to execute any copy command or make any Snowflake calls at all.

Refer to Snowflake documentation for details on how to set up Snowpipe for S3. In short you need to create a stage, a target table, a file format (I guess you already have these things in place though) and a pipe. Then set up SQS notifications for your bucket that the pipe will listen to.

Snowflake suggests having files sized around 10-100 MB, so it is likely a good idea to split the file.

# set up credentials (s3fs is built on BOTO hence this is AWS specific)
fs = s3fs.S3FileSystem(key=key, secret=secret)

# number of files to split into
n_chunks = 2

# loop over dataframe and dump chunk by chunk to S3
# (you likely want to expand file naming logic to avoid overwriting existing files)
for f_name, chunks in enumerate(np.array_split(np.arange(df.shape[0]), n_chunks)):
    bytes_to_write = df.iloc[chunks].to_csv(index=False).encode()
    with fs.open('s3://mybucket/test/dummy_{}.csv'.format(f_name), 'wb') as f:
        f.write(bytes_to_write)

For reference I tried this with a 7M row dataframe splitted into 5 files of around 40 MB. It took around 3 minutes and 40 seconds from starting splitting the dataframe until all rows had arrived in Snowflake.

like image 199
Karl Anka Avatar answered Sep 23 '22 08:09

Karl Anka


The optimal way that ilja-everila pointed out is “copy into...” as SF required the csv to be staged on cloud before transformation I was hesitant to do it but it seems like that is the only option given that the performance is in 5-10 minutes for 6.5million records.

like image 22
abdthe1 Avatar answered Sep 22 '22 08:09

abdthe1