How do I get a short hash of a long string using Excel VBA?
What's given
What I have done so far
I thought this SO answer is a good start since it generates a 4-digit Hex-Code (CRC16).
But 4 digits were too few. In my test with 400 strings, 20% got a duplicate somewhere else.
The chance to generate a collision is too high.
Sub tester()
For i = 2 To 433
Cells(i, 2) = CRC16(Cells(i, 1))
Next i
End Sub
Function CRC16(txt As String)
Dim x As Long
Dim mask, i, j, nC, Crc As Integer
Dim c As String
Crc = &HFFFF
For nC = 1 To Len(txt)
j = Val("&H" + Mid(txt, nC, 2))
Crc = Crc Xor j
For j = 1 To 8
mask = 0
If Crc / 2 <> Int(Crc / 2) Then mask = &HA001
Crc = Int(Crc / 2) And &H7FFF: Crc = Crc Xor mask
Next j
Next nC
CRC16 = Hex$(Crc)
End Function
How to reproduce
You can copy these 400 test strings from pastebin.
Paste them to column A in a new Excel workbook and execute the code above.
Q: How do I get a string hash which is short enough (12 chars) and long enough to get a small percentage of duplicates.
In order to create a unique hash from a specific string, it can be implemented using their own string to hash converting function. It will return the hash equivalent of a string. Also, a library named Crypto can be used to generate various types of hashes like SHA1, MD5, SHA256 and many more.
A hash is usually a hexadecimal string of several characters. Hashing is also a unidirectional process so you can never work backwards to get back the original data. A good hash algorithm should be complex enough such that it does not produce the same hash value from two different inputs.
Maybe others will find this useful.
I have collected some different functions to generate a short hash of a string in VBA.
I don't take credit for the code and all sources are referenced.
=CRC16HASH(A1)
with this Code
=CRC16NUMERIC(A1)
with this Code
=CRC16TWICE(A1)
with this Code
=SHA1TRUNC(A1)
with this Code
=BASE64SHA1(A1)
with this Code
Here is my test workbook with all example functions and a big number of test strings.
Feel free to add own functions.
Split your string into three shorter strings (if not divisible by three, the last one will be longer than the other two). Run your "short" algorithm on each, and concatenate the results.
I could write the code but based on the quality of the question I think you can take it from here!
EDIT: It turns out that that advice is not enough. There is a serious flaw in your original CRC16 code - namely the line that says:
j = Val("&H" + Mid(txt, nC, 2))
This only handles text that can be interpreted as hex values: lowercase and uppercase letters are the same, and anything after F in the alphabet is ignored (as far as I can tell). That anything good comes out at all is a miracle. If you replace the line with
j = asc(mid(txt, nC, 1))
Things work better - every ASCII code at least starts out life as its own value.
Combining this change with the proposal I made earlier, you get the following code:
Function hash12(s As String)
' create a 12 character hash from string s
Dim l As Integer, l3 As Integer
Dim s1 As String, s2 As String, s3 As String
l = Len(s)
l3 = Int(l / 3)
s1 = Mid(s, 1, l3) ' first part
s2 = Mid(s, l3 + 1, l3) ' middle part
s3 = Mid(s, 2 * l3 + 1) ' the rest of the string...
hash12 = hash4(s1) + hash4(s2) + hash4(s3)
End Function
Function hash4(txt)
' copied from the example
Dim x As Long
Dim mask, i, j, nC, crc As Integer
Dim c As String
crc = &HFFFF
For nC = 1 To Len(txt)
j = Asc(Mid(txt, nC)) ' <<<<<<< new line of code - makes all the difference
' instead of j = Val("&H" + Mid(txt, nC, 2))
crc = crc Xor j
For j = 1 To 8
mask = 0
If crc / 2 <> Int(crc / 2) Then mask = &HA001
crc = Int(crc / 2) And &H7FFF: crc = crc Xor mask
Next j
Next nC
c = Hex$(crc)
' <<<<< new section: make sure returned string is always 4 characters long >>>>>
' pad to always have length 4:
While Len(c) < 4
c = "0" & c
Wend
hash4 = c
End Function
You can place this code in your spreadsheet as =hash12("A2")
etc. For fun, you can also use the "new, improved" hash4 algorithm, and see how they compare. I created a pivot table to count collisions - there were none for the hash12
algorithm, and only 3 for the hash4
. I'm sure you can figure out how to create hash8
, ... from this. The "no need to be unique" from your question suggests that maybe the "improved" hash4
is all you need.
In principle, a four character hex should have 64k unique values - so the chance of two random strings having the same hash would be 1 in 64k. When you have 400 strings, there are 400 x 399 / 2 "possible collision pairs" ~ 80k opportunities (assuming you had highly random strings). Observing three collisions in the sample dataset is therefore not an unreasonable score. As your number of strings N goes up, the probability of collisions goes as the square of N. With the extra 32 bits of information in the hash12, you expect to see collisions when N > 20 M or so (handwaving, in-my-head-math).
You can make the hash12 code a little bit more compact, obviously - and it should be easy to see how to extend it to any length.
Oh - and one last thing. If you have RC addressing enabled, using =CRC16("string")
as a spreadsheet formula gives a hard-to-track #REF
error... which is why I renamed it hash4
32 bits hash function for strings with a low level of collision:
Public Function StrHash(text As String) As Long
Dim i As Long
StrHash = &H65D5BAAA
For i = 1 To Len(text)
StrHash = ((StrHash + AscW(Mid$(text, i, 1))) Mod 69208103) * 31&
Next
End Function
Or as a 64 bits hash function:
Public Function StrHash64(text As String) As String
Dim i&, h1&, h2&, c&
h1 = &H65D5BAAA
h2 = &H2454A5ED
For i = 1 To Len(text)
c = AscW(Mid$(text, i, 1))
h1 = ((h1 + c) Mod 69208103) * 31&
h2 = ((h2 + c) Mod 65009701) * 33&
Next
StrHash64 = Right("00000000" & Hex(h1), 8) & Right("00000000" & Hex(h2), 8)
End Function
Based on the FNV hash algorithm
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With