What are efficient ways to sort arrays that have mostly a small set of duplicated elements? That is, a list like:
{ 10, 10, 55, 10, 999, 8851243, 10, 55, 55, 55, 10, 999, 8851243, 10 }
Assuming that the order of equal
elements doesn't matter, what are good worst-case/average-case algorithms?
A simple solution would be to use efficient sorting algorithms like Merge Sort, Quicksort, Heapsort, etc., that can solve this problem in O(n. log(n)) time, but those will not take advantage of the fact that there are many duplicated values in the array. A better approach is to use a counting sort.
If you've observed, the time complexity of Quicksort is O(n logn) in the best and average case scenarios and O(n^2) in the worst case. But since it has the upper hand in the average cases for most inputs, Quicksort is generally considered the “fastest” sorting algorithm.
Insertion sort is the clear winner on this initial condition. Bubble sort is fast, but insertion sort has lower overhead. Shell sort is fast because it is based on insertion sort. Merge sort, heap sort, and quick sort do not adapt to nearly sorted data.
In general, the Quicksort algorithm has an average-case time complexity of O(n*log(n)) and worst-case time complexity of O(n2). With a high density of duplicate keys, we almost always get the worst-case performance with the trivial implementation of Quicksort.
In practice, you can first iterate through the array once and use a hash table the count the number of occurrences of the individual elements (this is O(n) where n = size of the list). Then take all the unique elements and sort them (this is O(k log k) where k = number of unique elements), and then expand this back to a list of n elements in O(n) steps, recovering the counts from the hash table. If k << n you save time.
I would try Counting sort with some mapping function. Ie. you wont use the frequencies array of size equal to the range of elements, instead you would iterate over the array, write down distinct elements and use them in a mapping function to the array of frequencies.
This way the algorithm has only one extra iteration and a mapping function, which should work in a constant time (using some kind of hash table). The complexity of this approach would be O(n)
, which should be optimal.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With