One of consequences of the IEEE 754 standard is the non-intuitive behavior of std::unordered_set<double>
, when not-a-number elements (NAN
s) are inserted.
Due to the fact that NAN!=NAN
, after the following sequence:
#include <iostream>
#include <cmath>
#include <unordered_set>
int main(){
std::unordered_set<double> set;
set.insert(NAN);
set.insert(NAN);
std::cout<<"Number of elements "<<set.size()<<"\n"; //there are 2 elements!
}
there are two elements in the set
(see it live): NAN
and NAN
!
Mine main issue with this is, that when N
NAN
s are inserted into the hash-set, they all hit the same hash-bucket and the performance of N
inserts into the hash-set degenerates to the worst-case running time - O(N^2)
.
For an example, see the listing at the end of the question or here live - inserting NAN
takes some order of magnitude more time than a "normal" floating number.
My question: is it possible (and if yes - how) to tweak std::unordered_set<double>
in such a way, that there is at most one NAN
-element in the set, no matter the flavor of inserted NAN
s (NAN
, -NAN
and so on)?
Listing:
#include <iostream>
#include <cmath>
#include <unordered_set>
#include <chrono>
constexpr int N=5000;
void test_insert(double value)
{
std::unordered_set<double> s;
auto begin = std::chrono::high_resolution_clock::now();
for (int i = 0; i < N; i++) {
s.insert(value);
}
auto end = std::chrono::high_resolution_clock::now();
std::cout << "Duration: " << (std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count() / 1e9) << "\n";
std::cout << "Number of elements: "<<s.size()<<"\n";
}
int main(){
std::cout << "Not NAN\n";
test_insert(1.0); //takes 0.0001 s
std::cout << "NAN\n";
test_insert(NAN); //takes 0.2 s
}
You can define your own predicate to compare the keys, and ensure NaNs compare equal for this purpose. This can be supplied as the third parameter to the std::unordered_set
class template.
See the definition of EqualPred
below (code copied from question and modified), and its use in declaring the unordered_set
variable. I took the default for the second parameter from the documentation at https://en.cppreference.com/w/cpp/container/unordered_set
Live demo: http://coliru.stacked-crooked.com/a/7085936431e6698f
#include <iostream>
#include <cmath>
#include <unordered_set>
#include <chrono>
struct EqualPred
{
constexpr bool operator()(const double& lhs, const double& rhs) const
{
if (std::isnan(lhs) && std::isnan(rhs)) return true;
return lhs == rhs;
}
};
constexpr int N=5000;
void test_insert(double value)
{
std::unordered_set<double, std::hash<double>, EqualPred> s;
auto begin = std::chrono::high_resolution_clock::now();
for (int i = 0; i < N; i++) {
s.insert(value);
}
auto end = std::chrono::high_resolution_clock::now();
std::cout << "Duration: " << (std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count() / 1e9) << "\n";
std::cout << "Number of elements: "<<s.size()<<"\n";
}
int main(){
std::cout << "Not NAN\n";
test_insert(1.0); //takes 0.0001 s
std::cout << "NAN\n";
test_insert(NAN); //takes 0.2 s
}
It is worth noting (thanks to @ead's comment) that -NaN
and +NaN
may hash to different values. If you want to handle these as identical, you'll need to provide a different implementation of the second template parameter, the hash function. This should detect any NaNs and hash the same NaN each time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With