Suppose we have a vector of pairs:
std::vector<std::pair<A,B>> v;
where for type A
only equality is defined:
bool operator==(A const & lhs, A const & rhs) { ... }
How would you sort it that all pairs with the same first
element will end up close? To be clear, the output I hope to achieve should be the same as does something like this:
std::unordered_multimap<A,B> m(v.begin(),v.end());
std::copy(m.begin(),m.end(),v.begin());
However I would like, if possible, to:
Edit: additional concrete information.
In my case the number of elements isn't particularly big (I expect N = 10~1000), though I have to repeat this sorting many times ( ~400) as part of a bigger algorithm, and the datatype known as A
is pretty big (it contains among other things an unordered_map
with ~20 std::pair<uint32_t,uint32_t>
in it, which is the structure preventing me to invent an ordering, and making it hard to build a hash function)
cluster()
and sort_within()
The handwritten double loop by @MadScienceDreams can be written as a cluster()
algorithm of O(N * K)
complexity with N
elements and K
clusters. It repeatedly calls std::partition
(using C++14 style with generic lambdas, easily adaptable to C++1, or even C++98 style by writing your own function objects):
template<class FwdIt, class Equal = std::equal_to<>>
void cluster(FwdIt first, FwdIt last, Equal eq = Equal{})
{
for (auto it = first; it != last; /* increment inside loop */)
it = std::partition(it, last, [=](auto const& elem){
return eq(elem, *it);
});
}
which you call on your input vector<std::pair>
as
cluster(begin(v), end(v), [](auto const& L, auto const& R){
return L.first == R.first;
});
The next algorithm to write is sort_within
which takes two predicates: an equality and a comparison function object, and repeatedly calls std::find_if_not
to find the end of the current range, followed by std::sort
to sort within that range:
template<class RndIt, class Equal = std::equal_to<>, class Compare = std::less<>>
void sort_within(RndIt first, RndIt last, Equal eq = Equal{}, Compare cmp = Compare{})
{
for (auto it = first; it != last; /* increment inside loop */) {
auto next = std::find_if_not(it, last, [=](auto const& elem){
return eq(elem, *it);
});
std::sort(it, next, cmp);
it = next;
}
}
On an already clustered input, you can call it as:
sort_within(begin(v), end(v),
[](auto const& L, auto const& R){ return L.first == R.first; },
[](auto const& L, auto const& R){ return L.second < R.second; }
);
Live Example that shows it for some real data using std::pair<int, int>
.
Even if there is no operator<
defined on A
, you might define it yourself. Here, there are two broad options. First, if A
is hashable, you can define
bool operator<(A const& L, A const& R)
{
return std::hash<A>()(L) < std::hash<A>()(R);
}
and write std::sort(begin(v), end(v))
directly. You will have O(N log N)
calls to std::hash
if you don't want to cache all the unique hash values in a separate storage.
Second, if A
is not hashable, but does have data member getters x()
, y()
and z()
, that uniquely determine equality on A
: you can do
bool operator<(A const& L, A const& R)
{
return std::tie(L.x(), L.y(), L.z()) < std::tie(R.x(), R.y(), R.z());
}
Again you can write std::sort(begin(v), end(v))
directly.
if you can come up with a function that assigns to each unique element a unique number, then you can build secondary array with this unique numbers and then sort secondary array and with it primary for example by merge sort.
But in this case you need function that assigns to each unique element a unique number i.e. hash-function without collisions. I think this should not be a problem.
And asymptotic of this solution if hash-function have O(1), then building secondary array is O(N) and sorting it with primary is O(NlogN). And summary O(N + NlogN) = O(N logN). And the bad side of this solution is that it requires double memory.
In conclusion the main sense of this solution is quickly translate your elements to elements which you can quickly compare.
An in place algorithm is
for (int i = 0; i < n-2; i++)
{
for (int j = i+2; j < n; j++)
{
if (v[j].first == v[i].first)
{
std::swap(v[j],v[i+1]);
i++;
}
}
There is probably a more elegant way to write the loop, but this is O(n*m), where n is the number of elements and m is the number of keys. So if m is much smaller than n (with a best case being that all the keys are the same), this can be approximated by O(n). Worst case, the number of key ~= n, so this is O(n^2). I have no idea what you expect for the number of keys, so I can't really do the average case, but it is most likely O(n^2) for the average case as well.
For a small number of keys, this may work faster than unordered multimap, but you'll have to measure to find out.
Note: the order of clusters is completely random.
Edit: (much more efficient in the partially-clustered case, doesn't change complexity)
for (int i = 0; i < n-2; i++)
{
for(;i<n-2 && v[i+1].first==v[i].first; i++){}
for (int j = i+2; j < n; j++)
{
if (v[j].first == v[i].first)
{
std::swap(v[j],v[i+1]);
i++;
}
}
Edit 2: At /u/MrPisarik's comment, removed redundant i check in inner loop.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With