I have to implement merge statement in snowflake DB. There will be more than 6 billion rows in Target table. There are multiple columns in comparison around 20. I was thinking to generate hash key using HASH function on the basis of all 20 columns in Snowflake.
But i read the document hash where it is mentioned that after 4 billion rows it is likely to get duplicate hash key. Is my understanding is correct.?
So should i avoid hash key for comparing records and use all the columns instead?
or can use md5 of hexa 128 bit or any customized hash function. Kindly suggest.

TL;DR version of this: With your number of rows, using the HASH function gives you a 62% chance that two rows' hash values will collide. Using MD5 instead of HASH will reduce your chances of a collision to a tiny fraction of a percent. On the same size warehouse, it will require about 24% more time to calculate the MD5s instead of the hashes. Recommendations: If very rare collisions are not tolerable, match on either 1) MD5 or 2) hash and column compare. Option 2 will be faster, but will require more maintenance if the schema changes over time.
The topic of using hashes in merges merits its own position paper, and I'm sure many have been written. We'll focus on your specifics and how Snowflake responds.
Let's start with the section of the docs you reference. When unique inputs lead to identical hash values, it's called a collision. It's a classic problem in mathematics and computation known as the Birthday Problem (https://en.wikipedia.org/wiki/Birthday_problem). There's a ton of writing on the subject, so we'll stick to what's relevant to your situation.
If you use a 64-bit hash on your 6 billion row table the probability of a collision is about 62%. Still, it's a manageable problem as we'll explore later.
If you use a 128-bit hash such as MD5 on 6 billion inputs the probability rounds to zero. Even if your table grows to 1000 times as many rows the probability of a collision would be 0.0000000000053%.
While superficially that seems to get around the collision problem, it introduces a new problem. The MD5 function is more computationally expensive than the hash function. We can determine how much through some simple tests on a Medium sized warehouse.
select count(md5((concat(*))))
from "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF10000"."ORDERS"; -- 18m 41s
select count(hash((concat(*))))
from "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF10000"."ORDERS"; -- 15m 6s
I used count to eliminate the results collection time, but it's still calculating the MD5s and hashes. MD5 takes ~24% longer to calculate than hash.
This brings us to the final part of the TL;DR discussion, using the hash function and column compares. This is the faster option, and the only one that guarantees no collisions. It's faster because this operation in pseudo code:
condition1 AND condition2
In this expression, if the first part fails there's no need to test the second part. I haven't tested this experimentally (yet) in a Snowflake merge match clause, but see no reason it would test the second part of the expression when matching rows. That way practically all matches would be handled quickly by comparing the hash, and only very rare cases would also have to test the columns matching.
One final thought: The fewer rows you merge each time relative to the size of the table will make the extra computation time for the MD5 less important. You've already "paid" for the extra computation time for the MD5 values sitting in the table. If you're merging a few thousand rows, the 24% extra time to calculate MD5 is inconsequential and saves you from having to maintain a column list in your match clause.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With