Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the most efficient way of copying elements that occur only once in a std vector?

I have a std vector with elements like this:

[0 , 1 , 2 , 0 , 2 , 1 , 0 , 0 , 188 , 220 , 0 , 1 , 2 ]

What is the most efficient way to find and copy the elements that occur only once in this vector, excluding the brute force O(n^2) algorithm? In this case the new list should contain [188, 220]

like image 946
sp497 Avatar asked Feb 05 '16 03:02

sp497


4 Answers

  1. Make an unordered_map<DataType, Count> count;
  2. Iterate over the input vector increasing count of each value. Sort of count[value]++;
  3. Iterate over the count map copying keys for which value is 1.

It's O(n). You have hashes so for small data sets normal map might be more efficient, but technically it would be O(n log n).

It's a good method for discrete data sets.

Code example:

#include <iostream>
#include <unordered_map>
#include <vector>
#include <algorithm>
using namespace std;

int main() {
    vector<int> v{1,1,2,3,3,4};
    unordered_map<int,int> count;
    for (const auto& e : v) count[e]++;
    vector<int> once;
    for (const auto& e : count) if(e.second == 1) once.push_back(e.first);
    for (const auto& e : once) cout << e << '\n';
    return 0;
}

I have tried few ideas. But I don't see a way around map. unordered_multiset is almost a great way... except it does not allow you to iterate over keys. It has a method to check for count of key, but you would need another set just for keys to probe. I don't see it as a simpler way. In modern c++ with autos counting is easy. I've also looked through algorithm library, but I haven't found any transfrom, copy_if, generate, etc. which could conditionally transform an element (map entry -> value if count is 1).

like image 71
luk32 Avatar answered Oct 19 '22 23:10

luk32


There are very few universally-optimal algorithms. Which algorithm works best usually depends upon the properties of the data that's being processed. Removing duplicates is one such example.

Is v small and filled mostly with unique values?

auto lo = v.begin(), hi = v.end();
std::sort(lo, hi);
while (lo != v.end()) {
    hi = std::mismatch(lo + 1, v.end(), lo).first;
    lo = (std::distance(lo, hi) == 1) ? hi : v.erase(lo, hi);
}

Is v small and filled mostly with duplicates?

auto lo = v.begin(), hi = v.end();
std::sort(lo, hi);
while (lo != v.end()) {
    hi = std::upper_bound(lo + 1, v.end(), *lo);
    lo = (std::distance(lo, hi) == 1) ? hi : v.erase(lo, hi);
}

Is v gigantic?

std::unordered_map<int, bool> keyUniqueness{};
keyUniqueness.reserve(v.size());
for (int key : v) {
    bool wasMissing = keyUniqueness.find(key) == keyUniqueness.end();
    keyUniqueness[key] = wasMissing;
}
v.clear();
for (const auto& element : keyUniqueness) {
    if (element.second) { v.push_back(element.first); }
}

And so on.

like image 36
Brian Rodriguez Avatar answered Oct 19 '22 23:10

Brian Rodriguez


@luk32's answer is definitely the most time efficient way of solving this question. However, if you are short on memory and can't afford an unordered_map, there are other ways of doing it.

You can use std::sort() to sort the vector first. Then the non-duplicates can be found in one iteration. Overall complexity being O(nlogn).

If the question is slightly different, and you know there is only one non-duplicate element, you can use this code (code in Java). The conplexity here is O(n).

like image 1
SegFault Avatar answered Oct 20 '22 01:10

SegFault


Since you use a std::vector, I presume you want to maximize all its benefits including reference locality. In order to do that, we need a bit of typing here. And I benchmarked the code below...

I have a linear O(n) algorithm here (effectively O(nlog(n))), its a bit like brian's answer, but I use OutputIterators instead of doing it in-place. The pre-condition is that it's sorted.


template<typename InputIterator, typename OutputIterator>
OutputIterator single_unique_copy(InputIterator first, InputIterator last, OutputIterator result){
    auto previous = first;
    if(previous == last || ++first == last) return result;
    while(true){
        if(*first == *previous)
            while((++first != last) && (*first == *previous));
        else
            *(result++) = *previous;
        if(first == last) break;
        previous = first;
        ++first;
    }
    return ++result;
}

And here is a sample usage:

int main(){
    std::vector<int> vm = {0, 1, 2, 0, 2, 1, 0, 0, 1, 88, 220, 0, 1, 2, 227, -8};
    std::vector<int> kk;
    std::sort(vm.begin(), vm.end());
    single_unique_copy(vm.begin(), vm.end(), std::back_inserter(kk));
    for(auto x : kk) std::cout << x << ' ';
    return 0;
}

As expected, the output is:

-8, 88, 220, 227

Your use case may be different from mine, so, profile first... :-)

EDIT:

  • Using luk32's algorithm and mine... Using 13 million elements... created in descending order, repeated at every i % 5.
  • Under debug build, luk32: 9.34seconds and mine: 7.80seconds
  • Under -O3, luk32: 2.71seconds and mine 0.52seconds
  • Mingw5.1 64bit, Windows10, 1.73Ghz Core i5 4210U, 6GB DDR3 1600Mhz RAM
  • Benchmark here, http://coliru.stacked-crooked.com/a/187e5e3841439742

For smaller numbers, the difference still holds, until it becomes a non-critical code

like image 1
WhiZTiM Avatar answered Oct 20 '22 01:10

WhiZTiM