Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can std::hash<std::string> return the same value for different strings?

Tags:

c++

c++11

c++17

Below link it is mentioned chances of collision but I am trying to use it for finding duplicate entry:

http://www.cplusplus.com/reference/functional/hash/

I am using std::hash<std::string> and storing the return value in std::unordered_set. if emplace is fails, I am marking string as it is duplicate string.

like image 677
Build Succeeded Avatar asked Jan 01 '23 10:01

Build Succeeded


1 Answers

Hashes are generally functions from a large space of values into a small space of values, e.g. from the space of all strings to 64-bit integers. There are a lot more strings than 64-bit integers, so obviously multiple strings can have the same hash. A good hash function is such that there's no simple rule relating strings with the same hash value.

So, when we want to use hashes to find duplicate strings (or duplicate anything), it's always a two-phase process (at least):

  1. Look for strings with identical hash (i.e. locate the "hash bucket" for your string)
  2. Do a character-by-character comparison of your string with other strings having the same hash.

std::unordered_set does this - and never mind the specifics. Note that it does this for you, so it's redundant for you to hash yourself, then store the result in an std::unordered_set.

Finally, note that there are other features one could use for initial duplicate screening - or for searching among the same-hash values. For example, string length: Before comparing two strings character-by-character, you check their lengths (which you should be able to access without actually iterating the strings); different lengths -> non-equal strings.

like image 158
einpoklum Avatar answered Jan 25 '23 23:01

einpoklum