Sort when only equality is available

Question

Suppose we have a vector of pairs:

std::vector<std::pair<A,B>> v;

where for type A only equality is defined:

bool operator==(A const & lhs, A const & rhs) { ... }

How would you sort it that all pairs with the same first element will end up close? To be clear, the output I hope to achieve should be the same as does something like this:

std::unordered_multimap<A,B> m(v.begin(),v.end());
std::copy(m.begin(),m.end(),v.begin());

However I would like, if possible, to:

Do the sorting in place.
Avoid the need to define a hash function for equality.

Edit: additional concrete information.

In my case the number of elements isn't particularly big (I expect N = 10~1000), though I have to repeat this sorting many times ( ~400) as part of a bigger algorithm, and the datatype known as A is pretty big (it contains among other things an unordered_map with ~20 std::pair<uint32_t,uint32_t> in it, which is the structure preventing me to invent an ordering, and making it hard to build a hash function)

TemplateRex · Accepted Answer

First option: `cluster()` and `sort_within()`

The handwritten double loop by @MadScienceDreams can be written as a cluster() algorithm of O(N * K) complexity with N elements and K clusters. It repeatedly calls std::partition (using C++14 style with generic lambdas, easily adaptable to C++1, or even C++98 style by writing your own function objects):

template<class FwdIt, class Equal = std::equal_to<>>
void cluster(FwdIt first, FwdIt last, Equal eq = Equal{}) 
{
    for (auto it = first; it != last; /* increment inside loop */)
        it = std::partition(it, last, [=](auto const& elem){
            return eq(elem, *it);    
        });    
}

which you call on your input vector<std::pair> as

cluster(begin(v), end(v), [](auto const& L, auto const& R){
    return L.first == R.first;
});

The next algorithm to write is sort_within which takes two predicates: an equality and a comparison function object, and repeatedly calls std::find_if_not to find the end of the current range, followed by std::sort to sort within that range:

template<class RndIt, class Equal = std::equal_to<>, class Compare = std::less<>>
void sort_within(RndIt first, RndIt last, Equal eq = Equal{}, Compare cmp = Compare{})
{
    for (auto it = first; it != last; /* increment inside loop */) {
        auto next = std::find_if_not(it, last, [=](auto const& elem){
            return eq(elem, *it);
        });
        std::sort(it, next, cmp);
        it = next;
    }
}

On an already clustered input, you can call it as:

sort_within(begin(v), end(v), 
    [](auto const& L, auto const& R){ return L.first == R.first; },
    [](auto const& L, auto const& R){ return L.second < R.second; }
);

Live Example that shows it for some real data using std::pair<int, int>.

Second option: user-defined comparison

Even if there is no operator< defined on A, you might define it yourself. Here, there are two broad options. First, if A is hashable, you can define

bool operator<(A const& L, A const& R)
{
    return std::hash<A>()(L) < std::hash<A>()(R);
}

and write std::sort(begin(v), end(v)) directly. You will have O(N log N) calls to std::hash if you don't want to cache all the unique hash values in a separate storage.

Second, if A is not hashable, but does have data member getters x(), y() and z(), that uniquely determine equality on A: you can do

bool operator<(A const& L, A const& R)
{
    return std::tie(L.x(), L.y(), L.z()) < std::tie(R.x(), R.y(), R.z());
}

Again you can write std::sort(begin(v), end(v)) directly.

MrPisarik · Answer

if you can come up with a function that assigns to each unique element a unique number, then you can build secondary array with this unique numbers and then sort secondary array and with it primary for example by merge sort.

But in this case you need function that assigns to each unique element a unique number i.e. hash-function without collisions. I think this should not be a problem.

And asymptotic of this solution if hash-function have O(1), then building secondary array is O(N) and sorting it with primary is O(NlogN). And summary O(N + NlogN) = O(N logN). And the bad side of this solution is that it requires double memory.

In conclusion the main sense of this solution is quickly translate your elements to elements which you can quickly compare.

IdeaHat · Answer

An in place algorithm is

for (int i = 0; i < n-2; i++)
{
   for (int j = i+2; j < n; j++)
   {
      if (v[j].first == v[i].first)
      {
         std::swap(v[j],v[i+1]);
         i++;
      }
 }

There is probably a more elegant way to write the loop, but this is O(n*m), where n is the number of elements and m is the number of keys. So if m is much smaller than n (with a best case being that all the keys are the same), this can be approximated by O(n). Worst case, the number of key ~= n, so this is O(n^2). I have no idea what you expect for the number of keys, so I can't really do the average case, but it is most likely O(n^2) for the average case as well.

For a small number of keys, this may work faster than unordered multimap, but you'll have to measure to find out.

Note: the order of clusters is completely random.

Edit: (much more efficient in the partially-clustered case, doesn't change complexity)

for (int i = 0; i < n-2; i++)
{
   for(;i<n-2 && v[i+1].first==v[i].first; i++){}

   for (int j = i+2; j < n; j++)
   {
      if (v[j].first == v[i].first)
      {
         std::swap(v[j],v[i+1]);
         i++;
      }
 }

Edit 2: At /u/MrPisarik's comment, removed redundant i check in inner loop.

Sort when only equality is available

Tags:

c++

algorithm

sorting

partition

pqnet

3 Answers

First option: `cluster()` and `sort_within()`

Second option: user-defined comparison

TemplateRex

MrPisarik

IdeaHat

Recent Activity

Donate For Us

Sort when only equality is available

Tags:

c++

algorithm

sorting

partition

pqnet

3 Answers

First option: cluster() and sort_within()

Second option: user-defined comparison

TemplateRex

MrPisarik

IdeaHat

Related questions

Recent Activity

Donate For Us

First option: `cluster()` and `sort_within()`