Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most efficient sorting algorithm for many identical keys?

What is the most efficient algorithm for grouping identical items together in an array, given the following:

  1. Almost all items are duplicated several times.
  2. The items are not necessarily integers or anything else that's similarly simple. The range of the keys is not even well-defined, let alone small. In fact, the keys can be arbitrary structs. This rules out the most simple forms of counting sort.
  3. We care about both asymptotic and non-asymptotic properties, and n may be small sometimes. However, when n is small, performance is still important because this function may be called several million times in a loop on millions of small datasets. This rules out any expensive hash function or using a complex data structure that needs to perform lots of memory allocations.
  4. The data may be sorted in arbitrary order as long as all identical items are grouped together.

If this is confusing, here's an example, assuming such a function is named groupIdentical:

uint[] foo = [1,2,3,2,1,5,4,5];
uint[] bar = groupIdentical(foo);
// One possibile correct value for bar:
// bar == [2,2,1,1,3,4,5,5].
// Another possible correct answer:
// bar == [1,1,2,2,5,5,4,3].

However, as a reminder, we cannot assume that the data is composed as integers.

Edit: Thank you for the answers. My main problem with hashing was that hash tables perform memory allocations to frequently. What I ended up doing was writing my own hash table that uses a region allocator that I had around to get around this problem. Works well.

like image 506
dsimcha Avatar asked Dec 09 '08 21:12

dsimcha


4 Answers

I think you could just hash the objects, since real order doesn't matter, only grouping. Identical objects will end up grouped in the same bucket. This is assuming that every type you're interested in has its own hash function, or you can define your own and overload it (taking each type as a parameter to a different hashCode function definition).

To avoid collisions across data types (so strings don't end up in the same bucket as doubles, for one example), you'd need to encode the data type into the hash. So, for example, if you have a 32-bit hash, maybe the first 5 bits could encode the data type, so you can have 32 different types in the same hash map.

EDIT: Let me just add that the reason that I'm suggesting a custom hash map is because I don't know of one that exposes enough of its internal implementation for you to get the values out of each bucket. There might be such an implementation that I don't know of. There are a lot of things I don't know. :)

like image 95
Bill the Lizard Avatar answered Sep 29 '22 14:09

Bill the Lizard


The magic word you're looking for here is multiset (or bag). It's not really a sort at all, since you don't care about the order as long as you have all the elements with equal keys grouped together. There are several canned implementations available, depending on the language you're using, but in general the hashed version above is asymptotically optimal, I believe: insert() is constant time, since you can compute a hash in O(1) and append colliding inserts to a list in O(1) time; you can retrieve one element from the bins in O(1) time, you just grab the first one in the bin; and you can therefore collect all of them in O(n) time, since you retrieve n elements with O(1) for each element.

like image 41
Charlie Martin Avatar answered Sep 29 '22 14:09

Charlie Martin


A galloping mergesort, such as python's built-in sort (c.f. timsort), has good expected performance when there are large runs of already-sorted data (like, in your example, identical objects) -- you'll skip O(log(N)) work per merge. You can also distribute a mergesort across multiple CPU's and disks, if your dataset is extremely large (this is called an "external" sort). However, it will be worst case O(Nlog(N)).

The only sorts that are faster than Nlog(N) are counting sorts, that exploit some common property of the keys. To use a linear time sort (hash table or radix/bucket sort), you'll have to hash the struct's to generate some kind of numerical key.

Radix sort will make multiple passes through the keys, so its expected time will be longer than a hashtable approach; and, since you don't care about lexicographic order, the hash table solution sounds better for you, if you can afford to hash the keys.

like image 24
user26294 Avatar answered Sep 29 '22 14:09

user26294


3-way QuickSort performs very well when there are large number of duplicates.

like image 43
Christian C. Salvadó Avatar answered Sep 29 '22 15:09

Christian C. Salvadó