Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Inserting multiple not-a-numbers into a std::unordered_set<double>

One of consequences of the IEEE 754 standard is the non-intuitive behavior of std::unordered_set<double>, when not-a-number elements (NANs) are inserted.

Due to the fact that NAN!=NAN, after the following sequence:

#include <iostream>
#include <cmath>
#include <unordered_set>

int main(){
    std::unordered_set<double> set;
    set.insert(NAN);
    set.insert(NAN);
    std::cout<<"Number of elements "<<set.size()<<"\n";  //there are 2 elements!
}

there are two elements in the set(see it live): NAN and NAN!

Mine main issue with this is, that when N NANs are inserted into the hash-set, they all hit the same hash-bucket and the performance of N inserts into the hash-set degenerates to the worst-case running time - O(N^2).

For an example, see the listing at the end of the question or here live - inserting NAN takes some order of magnitude more time than a "normal" floating number.

My question: is it possible (and if yes - how) to tweak std::unordered_set<double> in such a way, that there is at most one NAN-element in the set, no matter the flavor of inserted NANs (NAN, -NAN and so on)?


Listing:

#include <iostream>
#include <cmath>
#include <unordered_set>
#include <chrono>

constexpr int N=5000;
void test_insert(double value)
{
    std::unordered_set<double> s;
    auto begin = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < N; i++) {
        s.insert(value);
    }
    auto end = std::chrono::high_resolution_clock::now();
    std::cout << "Duration: " << (std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count() / 1e9) << "\n";
    std::cout << "Number of elements: "<<s.size()<<"\n";
}

int main(){
    std::cout << "Not NAN\n";
    test_insert(1.0);           //takes 0.0001 s
    std::cout << "NAN\n";
    test_insert(NAN);           //takes 0.2 s
}
like image 833
ead Avatar asked Jun 26 '18 12:06

ead


1 Answers

You can define your own predicate to compare the keys, and ensure NaNs compare equal for this purpose. This can be supplied as the third parameter to the std::unordered_set class template.

See the definition of EqualPred below (code copied from question and modified), and its use in declaring the unordered_set variable. I took the default for the second parameter from the documentation at https://en.cppreference.com/w/cpp/container/unordered_set

Live demo: http://coliru.stacked-crooked.com/a/7085936431e6698f

#include <iostream>
#include <cmath>
#include <unordered_set>
#include <chrono>

struct EqualPred
{
    constexpr bool operator()(const double& lhs, const double& rhs) const
    {
        if (std::isnan(lhs) && std::isnan(rhs)) return true;
        return lhs == rhs;
    }
};

constexpr int N=5000;
void test_insert(double value)
{
    std::unordered_set<double, std::hash<double>, EqualPred> s;
    auto begin = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < N; i++) {
        s.insert(value);
    }
    auto end = std::chrono::high_resolution_clock::now();
    std::cout << "Duration: " << (std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count() / 1e9) << "\n";
    std::cout << "Number of elements: "<<s.size()<<"\n";
}

int main(){
    std::cout << "Not NAN\n";
    test_insert(1.0);           //takes 0.0001 s
    std::cout << "NAN\n";
    test_insert(NAN);           //takes 0.2 s
}

It is worth noting (thanks to @ead's comment) that -NaN and +NaN may hash to different values. If you want to handle these as identical, you'll need to provide a different implementation of the second template parameter, the hash function. This should detect any NaNs and hash the same NaN each time.

like image 123
Andrew Avatar answered Sep 18 '22 14:09

Andrew