Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest method for fingerprinting an array (calculating a unique hash from an array of data)

I am using a lot of caching and buffering of API calls in my WWW Framework and one of the things that I end up using all around is 'fingerprinting' data in order to match cache filenames as well as detecting API calls that have already been made.

A lot of data is moved in arrays, like GET, POST and so on. As a result the uniqueness of an API call depends on the data.

As a result I need to fingerprint this information. To do it requires generating a 'fingerprint' from the data array as well and hashing it into a string that I can store and compare against.

For array serialization there is serialize() and json_encode() in PHP. After various benchmarks I consider json_encode() the faster method for serializing an array and am quite happy with it.

For hashing there is md5() and sha1() functions, of which md5() is faster according to my benchmarks.

So my current fingerprint algorithm is:

$fingerprint=md5(json_encode($array));

But I am having doubts whether this is the 'fastest possible' method for fingerprinting an array in PHP. I have tried Google and StackOverflow and have not found good alternatives though. Am I on a right track or do I need to do something different?

like image 928
kingmaple Avatar asked Mar 21 '12 21:03

kingmaple


1 Answers

Once you've got your array json_encoded, you should probably go with a non-cyrptographic hash function if you're primarily concerned with speed. Different hash functions are good for different things. MD5 and Sha1 are called cryptographic because they are hard to reverse (note they are widely considered deprecated for security purposes due to vulnerabilities). CRC (cyclic redundancy check) functions are error detecting codes and would be ill-suited for uniqueness anyways.

Wikipedia is a decent place to start for this, if only because contributions there generally have external links to library implementations: List of hash functions. I would recommend reading up on a few of the non-cryptographic libraries there and benchmarking them. The non-cryptographic functions are more written for speed and reasonable degrees of uniqueness, sacrificing security, error detection and other interesting properties, which from your description is exactly what you want.

One final note to consider if you're mainly concerned about speed is how you are going to store and compare the fingerprints themselves. MD5 outputs 128 bits of data, which won't fit into a numeric type in php without some extra library calls and overhead. For my money, I would bet that you could get the best speed of comparison and storage would come from an hash function that can output 64 bit numbers directly. Note that to get 64 numbers natively in php, you need to have 64 bit hardware and have php configured/installed in 64 bit mode. I have some code around here somewhere I used to test our staging and prod environments I could probably dig up if you're interested.

Btw, I don't think you're going to get any faster stringification of an array than json-encode. The heart of that problem is array walking and string manipulation, so essentially the speed is proportional to the verbosity of the output. JSON-encode is very terse compared to php's serialize or export functions. I bet if you looked through enough comments on the php documentation pages, you could find someone who wrote a hash function that takes an array as an input directly, but it would be a gamble whether it was any good at all.

Feel free to ask questions if I was unclear on anything.

like image 133
Patrick M Avatar answered Sep 22 '22 06:09

Patrick M