Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How snowflake internally performs updates?

As far as I know, underlying files (columnar format) is immutable. My question is, if files are immutable, how the updates are being performed. Do Snowflake maintains different versions of the same row, and returns the latest version based on key? or it inserts the data into new files behind the scene and deletes old files? How performance gets affected in these scenarios (querying current data), if time travel is set to 90 days as Snowflake need to maintain different version of the same row. But as Snowflake doesn't respect keys, how even different versions are detected. Any insights (document/video) on the detailed internals is appreciated.

like image 418
Manu Batham Avatar asked Feb 07 '18 00:02

Manu Batham


1 Answers

It's a complex question, but a basic ideas are as follows (quite a bit simplified):

  • records are stored in immutable micro-partitions on S3
  • a table is a list of micro-partitions
  • when a record is modified
    • its old micro-partition is marked as inactive (from that moment),
    • a new micro-partition is created, containing the modified record, but also other records from that micro-partition.
    • the new micro-partition is added to the table's list (marked as active from that moment)
  • inactive micro-partitions are not deleted for some time, allowing time-travel

So Snowflake doesn't need a record key, as each record is stored in only one file active at a given time.

The impact of performing updates on querying is marginal, the only visible impact might be that the files need to be fetched from S3 and cached on the warehouses.

For more info, I'd suggest going to Snowflake forums and asking there.

like image 158
Marcin Zukowski Avatar answered Sep 24 '22 00:09

Marcin Zukowski