Merge statement in SnowFlake seems to be writing too many rows. Is there a way to improve this?

Tags:

snowflake-cloud-data-platform

In Snowflake, I am doing a basic merge statement to update a set of rows in a table. The table has 1B rows and is 160GB. The table is clustered using a TenantId column as the clustering key. This column has 10k different values with fairly even distribution.

The data I am merging in are just updates, and include 1M records targeting a subset of those tenant IDs (~500). The merge joins this source to the target based on TenantId (the cluster key of the target) and a recordID.

The result of the merge correctly lists the number of rows that were updated, but is taking longer than I would expect. If I look at the query execution details, I see that the Merge operation in the plan (which takes up almost all the time compared to the table scans / joins) has "Bytes scanned" and "Bytes written" both equal to the 160GB size of my table.

The bytes written seems concerning there. Is there a way to get it to focus the writes on micro-partitions relevant to the records being touched? It doesn't seem like it should need to write the full size of the table.

Cluster depth for the table: 1.0208

Cluster information for the table: { "cluster_by_keys" : "LINEAR(TENANTID)", "total_partition_count" : 29827, "total_constant_partition_count" : 29646, "average_overlaps" : 0.0323, "average_depth" : 1.0208, "partition_depth_histogram" : { "00000" : 0, "00001" : 29643, "00002" : 19, "00003" : 49, "00004" : 55, "00005" : 17, "00006" : 9, "00007" : 25, "00008" : 5, "00009" : 5, "00010" : 0, "00011" : 0, "00012" : 0, "00013" : 0, "00014" : 0, "00015" : 0, "00016" : 0 } }

369

asked Oct 15 '21 15:10

Matthew Bell

1 Answers

You have to understand what is happening underneath and how Micro-partitions work to understand what is going on.

Snowflake tables appear mutable (allows updates) but underneath it is made up of immutable files. When executing an update to an existing record the files that represent that record are written to time time travel as a record in its previous state before the update. And the new record is written to the active micro-partitions; that's right, an update will create micropartitions, those visible to the active micro-partitions and existing ones are committed to time travel.

This is why insert-only modelling and architecture paradigms are so much more efficient than those that allow updates. Updates even in traditional RDBMs are expensive operations and in Big Data platforms this is pretty much impossible.

Yes Snowflake supports updates, but it is up to you to use the platform efficiently, and yes that even includes how you model on the platform.

132

answered Oct 26 '22 14:10

patrick_at_snowflake

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Merge statement in SnowFlake seems to be writing too many rows. Is there a way to improve this?

Tags:

snowflake-cloud-data-platform

Matthew Bell

People also ask

1 Answers

patrick_at_snowflake

Recent Activity

Donate For Us

Merge statement in SnowFlake seems to be writing too many rows. Is there a way to improve this?

Tags:

snowflake-cloud-data-platform

Matthew Bell

People also ask

1 Answers

patrick_at_snowflake

Related questions

Recent Activity

Donate For Us