Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to determine if a uint64 has been "seen" already

I've been interested in optimizing "renumbering" algorithms that can relabel an arbitrary array of integers with duplicates into labels starting from 1. Sets and maps are too slow for what I've been trying to do, as are sorts. Is there a data structure that only remembers if a number has been seen or not reliably? I was considering experimenting with a bloom filter, but I have >12M integers and the target performance is faster than a good hashmap. Is this possible?

Here's a simple example pseudo-c++ algorithm that would be slow:

// note: all elements guaranteed > 0
std::vector<uint64_t> arr = { 21942198, 91292, 21942198, ... millions more };

std::unordered_map<uint64_t, uint64_t> renumber;
renumber.reserve(arr.size());

uint64_t next_label = 1;
for (uint64_t i = 0; i < arr.size(); i++) {
    uint64_t elem = arr[i];
    if (renumber[elem]) {
        arr[i] = renumber[elem];
    }
    else {
        renumber[elem] = next_label;
        arr[i] = next_label;
        ++next_label;
    }
}

Example input/output:

{ 12, 38, 1201, 120, 12, 39, 320, 1, 1  }
->
{ 1, 2, 3, 4, 1, 5, 6, 7, 7 }
like image 660
SapphireSun Avatar asked Oct 23 '25 22:10

SapphireSun


1 Answers

Your algorithm is not bad, but the appropriate data structure to use for the map is a hash table with open addressing.

As explained in this answer, std::unordered_map can't be implemented that way: https://stackoverflow.com/a/31113618/5483526

So if the STL container is really too slow for you, then you can do better by making your own.

Note, however, that:

  1. 90% of the time, when someone complains about STL containers being too slow, they are running a debug build with optimizations turned off. Make sure you are running a release build compiled with optimizations on. Running your code on 12M integers should take a few milliseconds at most.
  2. You are accessing the map multiple times when only once is required, like this:
uint64_t next_label = 1;
for (size_t i = 0; i < arr.size(); i++) {
    uint64_t elem = arr[i];
    uint64_t &label = renumber[elem];
    if (!label) {
       label = next_label++;
    }
    arr[i] = label;
}

Note that the unordered_map operator [] returns a reference to the associated value (creating it if it doesn't exist), so you can test and modify the value without having to search the map again.

like image 163
Matt Timmermans Avatar answered Oct 26 '25 12:10

Matt Timmermans