What would be the best hashing algorithm if we had the following priorities (in that order): <ol> <li>Minimal hash collisions</li> <li>Performance</li> </ol> It doesn't have to be secure. Basically I'm trying to create an index based on a combination of properties of some objects. All the properties are strings. Any references to c# implementations would be appreciated.

Forget about the term "best". No matter which hash algorithm anyone might come up with, unless you have a very limited set of data that needs to be hashed, every algorithm that performs very well on average can become completely useless if only being fed with the right (or from your perspective "wrong") data. Instead of wasting too much time thinking about how to get the hash more collision-free without using too much CPU time, I'd rather start thinking about "How to make collisions less problematic". E.g. if every hash bucket is in fact a table and all strings in this table (that had a collision) are sorted alphabetically, you can search within a bucket table using binary search (which is only O(log n)) and that means, even when every second hash bucket has 4 collisions, your code will still have decent performance (it will be a bit slower compared to a collision free table, but not that much). One big advantage here is that if your table is big enough and your hash is not too simple, two strings resulting in the same hash value will usually look completely different (hence the binary search can stop comparing strings after maybe one or two characters on average; making every compare very fast). Actually I had a situation myself before where searching directly within a sorted table using binary search turned out to be faster than hashing! Even though my hash algorithm was simple, it took quite some time to hash the values. Performance testing showed that only if I get more than about 700-800 entries, hashing is indeed faster than binary search. However, as the table could never grow larger than 256 entries anyway and as the average table was below 10 entries, benchmarking clearly showed that on every system, every CPU, the binary search was faster. Here, the fact that usually already comparing the first byte of the data was enough to lead to the next bsearch iteration (as the data used to be very different in the first one to two byte already) turned out as a big advantage. So to summarize: I'd take a decent hash algorithm, that doesn't cause too many collisions on average and is rather fast (I'd even accept some more collisions, if it's just very fast!) and rather optimize my code how to get the smallest performance penalty once collisions do occur (and they will! They will unless your hash space is at least equal or bigger than your data space and you can map a unique hash value to every possible set of data).

Best hashing algorithm in terms of hash collisions and performance for strings

1 Answers

Forget about the term "best". No matter which hash algorithm anyone might come up with, unless you have a very limited set of data that needs to be hashed, every algorithm that performs very well on average can become completely useless if only being fed with the right (or from your perspective "wrong") data.

Instead of wasting too much time thinking about how to get the hash more collision-free without using too much CPU time, I'd rather start thinking about "How to make collisions less problematic". E.g. if every hash bucket is in fact a table and all strings in this table (that had a collision) are sorted alphabetically, you can search within a bucket table using binary search (which is only O(log n)) and that means, even when every second hash bucket has 4 collisions, your code will still have decent performance (it will be a bit slower compared to a collision free table, but not that much). One big advantage here is that if your table is big enough and your hash is not too simple, two strings resulting in the same hash value will usually look completely different (hence the binary search can stop comparing strings after maybe one or two characters on average; making every compare very fast).

Actually I had a situation myself before where searching directly within a sorted table using binary search turned out to be faster than hashing! Even though my hash algorithm was simple, it took quite some time to hash the values. Performance testing showed that only if I get more than about 700-800 entries, hashing is indeed faster than binary search. However, as the table could never grow larger than 256 entries anyway and as the average table was below 10 entries, benchmarking clearly showed that on every system, every CPU, the binary search was faster. Here, the fact that usually already comparing the first byte of the data was enough to lead to the next bsearch iteration (as the data used to be very different in the first one to two byte already) turned out as a big advantage.

So to summarize: I'd take a decent hash algorithm, that doesn't cause too many collisions on average and is rather fast (I'd even accept some more collisions, if it's just very fast!) and rather optimize my code how to get the smallest performance penalty once collisions do occur (and they will! They will unless your hash space is at least equal or bigger than your data space and you can map a unique hash value to every possible set of data).

answered Oct 19 '22 02:10

Mecki

Related questions
                            
                                Is there any way to automate windows forms testing?
                            
                                System.Web.Mvc not functioning as expected after Windows Update
                            
                                Can .NET Task instances go out of scope during run?
                            
                                Curiosity: Why does Expression<...> when compiled run faster than a minimal DynamicMethod?
                            
                                Views in separate assemblies in ASP.NET MVC
                            
                                NotNull attribute
                            
                                How can I have an overloaded constructor call both the default constructor as well as an overload of the base constructor?
                            
                                What open-source QR Code Generator would you recommend? [closed]
                            
                                Why is it bad to use an iteration variable in a lambda expression
                            
                                Is there any performance difference between ++i and i++ in C#?
                            
                                Delay then execute Task
                            
                                Unique file identifier in windows
                            
                                Correct way to delay the start of a Task
                            
                                Conflict between System.IdentityModel.Tokens and Microsoft.IdentityModel.Tokens
                            
                                What is the "base class" for C# numeric value types?
                            
                                What is point of SSL if fiddler 2 can decrypt all calls over HTTPS?
                            
                                The specified cast from a materialized 'System.Int32' type to the 'System.Double' type is not valid
                            
                                Is there any guidance on converting existing .NET class libraries to portable libraries?
                            
                                Mock.Of<Object> VS Mock<Object>()
                            
                                Why does an implicit conversion operator from <T> to <U> accept <T?>?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Best hashing algorithm in terms of hash collisions and performance for strings

Tags:

c#

algorithm

hash

dpan

People also ask

1 Answers

Mecki

Recent Activity

Donate For Us