Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast sort algorithms for arrays with mostly duplicated elements?

What are efficient ways to sort arrays that have mostly a small set of duplicated elements? That is, a list like:

{ 10, 10, 55, 10, 999, 8851243, 10, 55, 55, 55, 10, 999, 8851243, 10 }

Assuming that the order of equal elements doesn't matter, what are good worst-case/average-case algorithms?

like image 914
donnyton Avatar asked Nov 18 '11 05:11

donnyton


People also ask

How will you sort an array with many duplicated values?

A simple solution would be to use efficient sorting algorithms like Merge Sort, Quicksort, Heapsort, etc., that can solve this problem in O(n. log(n)) time, but those will not take advantage of the fact that there are many duplicated values in the array. A better approach is to use a counting sort.

What is the fastest array sorting algorithm?

If you've observed, the time complexity of Quicksort is O(n logn) in the best and average case scenarios and O(n^2) in the worst case. But since it has the upper hand in the average cases for most inputs, Quicksort is generally considered the “fastest” sorting algorithm.

Which sort algorithm works best on mostly sorted data?

Insertion sort is the clear winner on this initial condition. Bubble sort is fast, but insertion sort has lower overhead. Shell sort is fast because it is based on insertion sort. Merge sort, heap sort, and quick sort do not adapt to nearly sorted data.

Can Quicksort handle duplicates?

In general, the Quicksort algorithm has an average-case time complexity of O(n*log(n)) and worst-case time complexity of O(n2). With a high density of duplicate keys, we almost always get the worst-case performance with the trivial implementation of Quicksort.


2 Answers

In practice, you can first iterate through the array once and use a hash table the count the number of occurrences of the individual elements (this is O(n) where n = size of the list). Then take all the unique elements and sort them (this is O(k log k) where k = number of unique elements), and then expand this back to a list of n elements in O(n) steps, recovering the counts from the hash table. If k << n you save time.

like image 168
Antti Huima Avatar answered Oct 18 '22 18:10

Antti Huima


I would try Counting sort with some mapping function. Ie. you wont use the frequencies array of size equal to the range of elements, instead you would iterate over the array, write down distinct elements and use them in a mapping function to the array of frequencies.

This way the algorithm has only one extra iteration and a mapping function, which should work in a constant time (using some kind of hash table). The complexity of this approach would be O(n), which should be optimal.

like image 21
malejpavouk Avatar answered Oct 18 '22 18:10

malejpavouk