What hash algorithms are parallelizable? Optimizing the hashing of large files utilizing on multi-core CPUs

Tags:

I'm interested in optimizing the hashing of some large files (optimizing wall clock time). The I/O has been optimized well enough already and the I/O device (local SSD) is only tapped at about 25% of capacity, while one of the CPU cores is completely maxed-out.

I have more cores available, and in the future will likely have even more cores. So far I've only been able to tap into more cores if I happen to need multiple hashes of the same file, say an MD5 AND a SHA256 at the same time. I can use the same I/O stream to feed two or more hash algorithms, and I get the faster algorithms done for free (as far as wall clock time). As I understand most hash algorithms, each new bit changes the entire result, and it is inherently challenging/impossible to do in parallel.

Are any of the mainstream hash algorithms parallelizable?
Are there any non-mainstream hashes that are parallelizable (and that have at least a sample implementation available)?

As future CPUs will trend toward more cores and a leveling off in clock speed, is there any way to improve the performance of file hashing? (other than liquid nitrogen cooled overclocking?) or is it inherently non-parallelizable?

691

asked Apr 26 '10 21:04

DanO

2 Answers

There is actually a lot of research going on in this area. The US National Institute of Standards and Technology is currently holding a competition to design the next-generation of government-grade hash function. Most of the proposals for that are parallelizable.

One example: http://www.schneier.com/skein1.2.pdf

Wikipedia's description of current status of the contest: http://en.wikipedia.org/wiki/SHA-3

answered Nov 21 '22 11:11

sblom

What kind of SSD do you have ? My C implementation of MD5 runs at 400 MB/s on a single Intel Core2 core (2.4 GHz, not the latest Intel). Do you really have SSD which support a bandwidth of 1.6 GB/s ? I want the same !

Tree hashing can be applied on any hash function. There are a few subtleties and the Skein specification tries to deal with them, integrating some metadata in the function itself (this does not change much things for performance), but the "tree mode" of Skein is not "the" Skein as submitted to SHA-3. Even if Skein is selected as SHA-3, the output of a tree-mode hash would not be the same as the output of "plain Skein".

Hopefully, a standard will be defined at some point, to describe generic tree hashing. Right now there is none. However, some protocols have been defined with support for a custom tree hashing with the Tiger hash function, under the name "TTH" (Tiger Tree Hash) or "THEX" (Tree Hash Exchange Format). The specification for TTH appears to be a bit elusive; I find some references to drafts which have either moved or disappeared for good.

Still, I am a bit dubious about the concept. It is kind of neat, but provides a performance boost only if you can read data faster than what a single core can process, and, given the right function and the right implementation, a single core can hash quite a lot of data per second. A tree hash spread over several cores requires having the data sent to the proper cores, and 1.6 GB/s is not the smallest bandwidth ever.

SHA-256 and SHA-512 are not very fast. Among the SHA-3 candidates, assuming an x86 processor in 64-bit mode, some of them achieve high speed (more than 300 MB/s on my 2.4 GHz Intel Core2 Q6600, with a single core -- that's what I can get out of SHA-1, too), e.g. BMW, SHABAL or Skein. Cryptographically speaking, these designs are a bit too new, but MD5 and SHA-1 are already cryptographically "broken" (quite effectively in the case of MD5, rather theoretically for SHA-1) so any of the round-2 SHA-3 candidates should be fine.

When I put my "seer" cap, I foresee that processors will keep on becoming faster than RAM, to the point that hashing cost will be dwarfed out by memory bandwidth: the CPU will have clock cycles to spare while it waits for the data from the main RAM. At some point, the whole threading model (one big RAM for many cores) will have to be amended.

answered Nov 21 '22 13:11

Thomas Pornin

Related questions
                            
                                How to enable Code Analysis in Visual Studio 2010 Professional?
                            
                                Jump to Documentation for a function/class in Visual Studio
                            
                                What is the purpose of the PermissionSet attribute in the MSDN FileSystemWatcher class example?
                            
                                How do you pass a BitmapImage from a background thread to the UI thread in WPF?
                            
                                Any good debugger for HTML5 Javascript postMessage API? [closed]
                            
                                How do you manage multiple versions of the same software for each customer?
                            
                                JavaScript: Decimal Values
                            
                                What is the advantage of using Java Beans?
                            
                                iPhone app "has active assertions beyond permitted time"
                            
                                Can someone explain how to use FastTags
                            
                                Rails on Windows is so slow (rails -v takes 4 seconds)
                            
                                ISO Standard Street Addresses?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With