What kind of hashing algorithm is used in the built-in HASH() function?
I'm ideally looking for a SHA512/SHA256 hash, similar to what the SHA() function offers within the linkedin datafu UDFs for Pig.
MD5 is considered cryptographically broken and is unsuitable for further use. SHA1. SHA1 (Secure Hash Algorithm) is a cryptographic hash function designed by the National Security Agency (NSA). SHA1 produces a 160-bit (20-byte) hash value, typically rendered as a hexadecimal number, 40 digits long.
Some common hashing algorithms include MD5, SHA-1, SHA-2, NTLM, and LANMAN. MD5: This is the fifth version of the Message Digest algorithm. MD5 creates 128-bit outputs. MD5 was a very commonly used hashing algorithm.
A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes. The values are usually used to index a fixed-size table called a hash table.
So, there you have it: Python uses SipHash because it's a trusted, cryptographic hash function that should prevent collision attacks.
HASH
function (as of Hive 0.11) uses algorithm similar to java.util.List#hashCode.
Its code looks like this:
int hashCode = 0; // Hive HASH uses 0 as the seed, List#hashCode uses 1. I don't know why.
for (Object item: items) {
hashCode = hashCode * 31 + (item == null ? 0 : item.hashCode());
}
Basically it's a classic hash algorithm as recommended in the book Effective Java. To quote a great man (and a great book):
The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.
I digress. You can look at the HASH
source here.
If you want to use SHAxxx in Hive then you can use Apache DigestUtils class and Hive built-in reflect
function (I hope that'll work):
SELECT reflect('org.apache.commons.codec.digest.DigestUtils', 'sha256Hex', 'your_string')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With