Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What kind of hash algorithm is used for Hive's built-in HASH() Function

What kind of hashing algorithm is used in the built-in HASH() function?

I'm ideally looking for a SHA512/SHA256 hash, similar to what the SHA() function offers within the linkedin datafu UDFs for Pig.

like image 407
user1152532 Avatar asked Jan 17 '14 02:01

user1152532


People also ask

What are MD5 hash SHA-1 SHA-256 and sha512?

MD5 is considered cryptographically broken and is unsuitable for further use. SHA1. SHA1 (Secure Hash Algorithm) is a cryptographic hash function designed by the National Security Agency (NSA). SHA1 produces a 160-bit (20-byte) hash value, typically rendered as a hexadecimal number, 40 digits long.

Which algorithm is used for hashing?

Some common hashing algorithms include MD5, SHA-1, SHA-2, NTLM, and LANMAN. MD5: This is the fifth version of the Message Digest algorithm. MD5 creates 128-bit outputs. MD5 was a very commonly used hashing algorithm.

What is a hash function What types of hash functions can be used?

A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes. The values are usually used to index a fixed-size table called a hash table.

What type of hash function does Python use?

So, there you have it: Python uses SipHash because it's a trusted, cryptographic hash function that should prevent collision attacks.


1 Answers

HASH function (as of Hive 0.11) uses algorithm similar to java.util.List#hashCode.

Its code looks like this:

int hashCode = 0; // Hive HASH uses 0 as the seed, List#hashCode uses 1. I don't know why.
for (Object item: items) {
   hashCode = hashCode * 31 + (item == null ? 0 : item.hashCode());
}

Basically it's a classic hash algorithm as recommended in the book Effective Java. To quote a great man (and a great book):

The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.

I digress. You can look at the HASH source here.

If you want to use SHAxxx in Hive then you can use Apache DigestUtils class and Hive built-in reflect function (I hope that'll work):

SELECT reflect('org.apache.commons.codec.digest.DigestUtils', 'sha256Hex', 'your_string')
like image 160
Nigel Tufnel Avatar answered Sep 20 '22 09:09

Nigel Tufnel