Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Usefulness of KeyEqual in std::unordered_set/std::unordered_map

Tags:

c++

stl

I understand that this may be vague question, but I wonder what are real world cases when custom comparator is useful for hash containers in std. I understand it's usefulness in ordered containers, but for hash containers it seems a bit weird.

Reason for this is that hash value for elements that are equal according to comparator needs to be the same, and I believe that in most cases that actually means converting lookup/insert element to some common representation(it is faster and easier to implement).

For example:

  • set of case insensitive strings: if you want to hash properly you need to uppercase/lowercase the entire string anyway.
  • set of fractions(where 2/3 == 42/63): you need to convert 42/63 to 2/3 and then hash that...

So I wonder if someone can provide some real world examples of usefulness of customizing std::unordered_ template parameters so I can recognize those patterns in future code I write.

Note 1: "symmetry argument" (std::map enables customization of a comparator so std::unordred_should be customizable also) is something I considered and I do not think it is convincing.

Note 2: I mixed 2 kind of comparators (< and ==) in the post for brevity, I know that std::map uses < and std::unordered_map uses ==.

like image 458
NoSenseEtAl Avatar asked Dec 17 '22 16:12

NoSenseEtAl


2 Answers

As per https://en.cppreference.com/w/cpp/container/unordered_set

Internally, the elements are not sorted in any particular order, but organized into buckets. Which bucket an element is placed into depends entirely on the hash of its value. This allows fast access to individual elements, since once a hash is computed, it refers to the exact bucket the element is placed into.

So the hash function defines the bucket your element will end up in, but once the bucket is decided, in order to find the element, the operator == will be used.

Basically operator == is used to resolve hash collision, and hence, you need your hash function and your operator == to be consistent. Furthermore, if your operator operator == says that two elements are equal, the set will not allow a duplication.

For what concerns customization, I think that the idea of case-insensitive set of strings is a good one: given two strings you will need to provide a case-insensitive hash-function to allow the set to determine the bucket it has to store the string in. Then you will need to provide a custom KeyEqualto allow the set to actually retrieve the element.

A case I had to deal with, in the past, was a way to allow users to insert strings, keeping track of their order of insertion but avoiding duplicates. So, given a struct like:

struct keyword{
  std::string value;
  int sequenceCounter;
};

You want to detect duplicates according only to value. One of the solutions I came up with was an unordered_set with a custom comparator/hash function, that used only value. This allowed me to check for the existence of a key before allowing insertion.

like image 121
CuriouslyRecurringThoughts Avatar answered Dec 20 '22 06:12

CuriouslyRecurringThoughts


One interesting usage is to define memory efficient indexes (database sense of the term) for a given set of objects.

Example

Let's say we have a program that has a collection of N objects of this class:

struct Person {
    // each object has a unique firstName/lastName pair
    std::string firstName;
    std::string lastName;

    // each object has a unique ssn value
    std::string socialSecurityNumber;

    // each object has a unique email value
    std::string email;
}

And we need to retrieve efficiently objects by the value of any unique property.

Implementations comparison

Time complexities are given assuming string comparisons are constant time (strings have limited length).

1) Single unordered_map

With a single map indexed by a single key (ex: email):

std::unordered_map<std::string,Person> indexedByEmail;
  • Time complexity: lookup by any unique property other than email requires a traversal of the map: average O(N).
  • Memory usage: the email value is duplicated. This could be avoided by using a single set with custom hash & compare (see 3).

2) Multiple unordered_map, no custom hash & compare

With a map for each unique property, with default hash & comparisons:

std::unordered_map<std::pair<std::string,std::string>, Person> byName;
std::unordered_map<std::string, const Person*> byEmail;
std::unordered_map<std::string, const Person*> bySSN;
  • Time complexity: by using the appropriate map, a lookup by any unique property is average O(1).
  • Memory usage: inefficient, because of all the string duplications.

3) Multiple unordered_set, custom hash & comparison:

With custom hash & comparison, we define different unordered_set which will hash & compare only specific fields of the objects. Theses sets can be used to perform lookup as if items were stored in a map as in 2, but without duplicating any field.

using StrHash = std::hash<std::string>;
// --------------------
struct PersonNameHash {
    std::size_t operator()(const Person& p) const {
        // not the best hashing function in the world, but good enough for demo purposes.
        return StrHash()(p.firstName) + StrHash()(p.lastName);
    }
};
struct PersonNameEqual {
    bool operator()(const Person& p1, const Person& p2) const {
        return (p1.firstName == p2.firstName) && (p1.lastName == p2.lastName);
    }
};
std::unordered_set<Person, PersonNameHash, PersonNameEqual> byName;

// --------------------
struct PersonSsnHash {
    std::size_t operator()(const Person* p) const {
        return StrHash()(p->socialSecurityNumber);
    }
};
struct PersonSsnEqual {
    bool operator()(const Person* p1, const Person* p2) const {
        return p1->socialSecurityNumber == p2->socialSecurityNumber;
    }
};
std::unordered_set<const Person*, PersonSsnHash, PersonSsnEqual> bySSN;

// --------------------
struct PersonEmailHash {
    std::size_t operator()(const Person* p) const {
        return StrHash()(p->email);
    }
};
struct PersonEmailEqual {
    bool operator()(const Person* p1, const Person* p2) const {
        return p1->email == p2->email;
    }
};
std::unordered_set<const Person*,PersonEmailHash,PersonEmailEqual> byEmail;
  • Time complexity: a lookup by any unique property is still O(1) average.
  • Memory usage: much better than 2): no string duplication.

Live demo

like image 44
Vincent Saulue-Laborde Avatar answered Dec 20 '22 06:12

Vincent Saulue-Laborde