When I do unordered_set::find
unordered_set<int> uniqueNum;
//code...
if(uniqueNum.find(num + k) != uniqueNum.end())
//code ...
the runtime of this code is faster than
unordered_set<int> uniqueNum;
//code...
if(find(uniqueNum.begin(), uniqueNum.end(), num + k) != uniqueNum.end())
//code...
According to the reference, unordered_set::find is "Worst case: linear in container size" while find is "Up to linear in the distance between first and last: Compares elements until a match is found".
Are they not the same runtimes? Why is unordered_set::find faster when I run my code? Is std::find doing something behind the hood that I'm missing?
This is due to how they are implemented. std::find
operates as you might expect. Start at the beginning and compare each element until it reaches the end. This is fairly universal, but won't benefit from the specific data structure used. However, unordered_set
is a hashset so if there are no hash collisions, every element takes the same amount of time to find.
The reason it says there is a "worst case of linear in container size" is because if the hash table were to have a length of 1, every entry would be placed at the same position in the table (pseudocode: table[hash(element) % table_length].push(element)
). If this were to happen, then depending on the implementation it could end up looking more like a list in memory and it would have to check every item sequentially. In practice though, this would likely never happen.
An unordered set is like a filing cabinet. Let's say you have files on all the employees in a company. The filing cabinet has 26 drawers, each labeled with a single letter. Each employee's record is stored by the first letter of the surname. The files within a drawer are not further organized.
When unordered_set::find
is told to find an employee's record, it goes directly to the drawer labeled with the first letter of the surname and searches through all of the records in that drawer. When std::find
is given the same task, it starts at the top-left drawer and checks all the records in there, before moving to drawer next to it, and so on, until all the drawers are checked or the record is found. (Note that the top-left drawer is not necessarily "A".)
Let's say that the company has 20 employees. Given a typical distribution of names, unordered_set::find
is likely to go to a drawer with exactly one record in it, which would be the one you are looking for. Maybe it finds two records. Still quick and easy. This represents the common case if your hash function is up to the task. Meanwhile, std::find
might have to look through all of the records to find the one you are looking for. Sometimes it gets lucky and finds it right away. On average, it will look through half the records.
However, the typical case is not the worst case. The worst case is that the company's last recruitment drive was at a family reunion, and as a consequence all 20 employees are named "Jones". The normally fast unordered_set::find
will make a beeline to drawer "J" only to find every employee record in that drawer. It will look through, on average, half of the records before finding the desired one, the same as std::find
.
Should you be concerned with typical or worst-case times? That depends on your particular circumstances. Sometimes there is a systematic reason for falling into the worst case, akin to recruiting at a family reunion. On the other hand, if the names are randomly distributed, the chance in this example of having 10 (or more) records in the same drawer is around 1 in 5×1012; the true worst case is even rarer (involving 1026).... Usually you can count on fast look-ups.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With