Currently I working on a project where some information have to be hashed. As the dataset is huge (millions of records created every day) the algorithm for data transformation has to be fast.
The pieces of data that have to be hashed are fixed length (11 decimal numbers - example: 05018144298). So what I would like to know is whether it is worth to create own hash function instead of using some of the existing (for example MD5) in order to significantly decrease the processing time and if so then what would be the best way to do it. Is it a good way to modify some of the existing algorithm (for example MD5 but break the input into chunks of smaller size and modify other parameters for fixed input of 11 decimal numbers) or is it better to design a hash function from scratch?
Thank you!
It is not worth doing anything, performance-wise, until you have actually measured that using an existing hash function really has some non-negligible impact. A typical MD5 implementation, on a typical PC, will be able to process a few millions small messages per second, using a single core on the main CPU. Chances are that your "millions per day" are a piece of cake.
Designing your own hash function, while keeping the security features of a hash function, is a very bad idea. Right now, the top cryptographers in the world are involved in the design of a new standard hash function, in an open competition organized by NIST. Dozens of very specialized researchers have worked on those for several years, and will keep on doing so for about two more years. A lone programmer, not very specialized in the subject, trying to do better within a few days or weeks, verges on the preposterous. Designing a secure hash function is hard.
The right thing to do, for you, is to use an existing, standard cryptographic hash function. That's not MD5, by the way; serious weaknesses have been uncovered in that function (actually, serious weaknesses have been uncovered around 1996, and MD5 has been unrecommended for the last 15 years). You'd better use SHA-256.
If you do not need the cryptographic properties of a hash function but just a kind of randomizing function for hashtable-like indexing, then any hash function will be good enough. Just profile it, notice that there is no performance issue whatsoever, and be happy.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With