I've been interested in optimizing "renumbering" algorithms that can relabel an arbitrary array of integers with duplicates into labels starting from 1. Sets and maps are too slow for what I've been trying to do, as are sorts. Is there a data structure that only remembers if a number has been seen or not reliably? I was considering experimenting with a bloom filter, but I have >12M integers and the target performance is faster than a good hashmap. Is this possible?
Here's a simple example pseudo-c++ algorithm that would be slow:
// note: all elements guaranteed > 0
std::vector<uint64_t> arr = { 21942198, 91292, 21942198, ... millions more };
std::unordered_map<uint64_t, uint64_t> renumber;
renumber.reserve(arr.size());
uint64_t next_label = 1;
for (uint64_t i = 0; i < arr.size(); i++) {
uint64_t elem = arr[i];
if (renumber[elem]) {
arr[i] = renumber[elem];
}
else {
renumber[elem] = next_label;
arr[i] = next_label;
++next_label;
}
}
Example input/output:
{ 12, 38, 1201, 120, 12, 39, 320, 1, 1 }
->
{ 1, 2, 3, 4, 1, 5, 6, 7, 7 }
Your algorithm is not bad, but the appropriate data structure to use for the map is a hash table with open addressing.
As explained in this answer, std::unordered_map can't be implemented that way: https://stackoverflow.com/a/31113618/5483526
So if the STL container is really too slow for you, then you can do better by making your own.
Note, however, that:
uint64_t next_label = 1;
for (size_t i = 0; i < arr.size(); i++) {
uint64_t elem = arr[i];
uint64_t &label = renumber[elem];
if (!label) {
label = next_label++;
}
arr[i] = label;
}
Note that the unordered_map operator [] returns a reference to the associated value (creating it if it doesn't exist), so you can test and modify the value without having to search the map again.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With