Sorting algorithms for data of known statistical distribution?

Tags:

It just occurred to me, if you know something about the distribution (in the statistical sense) of the data to sort, the performance of a sorting algorithm might benefit if you take that information into account.

So my question is, are there any sorting algorithms that take into account that kind of information? How good are they?

An example to clarify: if you know the distribution of your data to be Gaussian, you could estimate mean and average on the fly as you process the data. This would give you an estimate of the final position of each number, which you could use to place them close to their final position.

I'm pretty surprised the answer isn't a wiki link to a thourough page discussing this issue. Isn't this a very common case (the Gaussian case, for example)?

I'm adding a bounty to this question, because I'm looking for definite answers with sources, not speculation. Something like "in the case of gaussian distributed data, XYZ algorithm is the fastest on average, as was proved by Smith et al. [1]". However any additional information is welcome.

405

asked May 29 '11 07:05

static_rtti

4 Answers

If the data you are sorting has a known distribution, I would use a Bucket Sort algorithm. You could add some extra logic to it so that you calculated the size and/or positions of the various buckets based upon properties of the distribution (ex: for Gaussian, you might have a bucket every (sigma/k) away from the mean, where sigma is the standard deviation of the distribution).

By having a known distribution and modifying the standard Bucket Sort algorithm in this way, you would probably get the Histogram Sort algorithm or something close to it. Of course, your algorithm would be computationally faster than the the Histogram Sort algorithm because there would probably not be a need to do the first pass (described in the link) since you already know the distribution.

Edit: given your new criteria of your question, (though my previous answer concerning Histogram Sort links to the respectable NIST and contains performance information), here is a peer review journal article from the International Conference on Parallel Processing:

Adaptive Data Partition for Sorting Using Probability Distribution

The authors claim this algorithm has better performance (up to 30% better) than the popular Quick-Sort Algorithm.

answered Oct 11 '22 23:10

Jason Moore

Sounds like you might want to read Self-Improving Algorithms: they achieve an eventual optimal expected running time for arbitrary input distributions.

We give such self-improving algorithms for two problems: (i) sorting a sequence of numbers and (ii) computing the Delaunay triangulation of a planar point set. Both algorithms achieve optimal expected limiting complexity. The algorithms begin with a training phase during which they collect information about the input distribution, followed by a stationary regime in which the algorithms settle to their optimized incarnations.

If you already know your input distribution is approximately Gaussian, then perhaps another approach would be more efficient in terms of space complexity, but in terms of expected running time this is a rather wonderful result.

answered Oct 12 '22 01:10

Jason Davies

Knowing the data source distribution, one can build a good hash function. Knowing the distribution well, the hash function may prove to be a perfect hash function, or close to perfect for many input vectors.

Such function would divide an input of size n into n bins, such that the smallest item would map into the 1st bin, and the largest item would map to the last bin. When the hash is perfect- we would achieve sort just be inserting all the items into the bins.

Inserting all the items into a hash table, then extracting them by order will be O(n) when the hash is perfect (assuming the hash function calculation cost is O(1), and the underline hash data structure operations are O(1)).

I would use an array of fibonacci heaps to implement the hash-table.

For input vector for which the hash function won't be perfect (but still close to perfect), it would still be much better than O(nlogn). When it is perfect - it would be O(n). I'm not sure how to calculate the average complexity, but if forced to, I would bet on O(nloglogn).

answered Oct 12 '22 00:10

13 revs

Computer sorting algorithms can be classified into two categories, comparison-based sorting and non-comparison-based sorting. For comparison-based sorting, the sorting time in its best-case performance is Ω (nlogn), while in its worst-case performance the sorting time can rise up to O(n2 ). In recent years, some improved algorithms have been proposed to speed up comparison-based sorting, such as advanced quick sort according to data distribution characteristics . However, the average sorting time for these algorithms is just Ω (nlog2n), and only in the best-case can it reach O(n). Different from comparison-based sorting, non-comparison-based sorting such as count sorting, bucket sorting and radix sorting depends mainly on key and address calculation. When the values of keys are finite ranging from 1 to m, the computational complexity of non-comparison-based sorting is O(m+n). Particularly, when m=O(n), the sorting time can reach O(n). However, when m=n2, n3, …., the upper bound of linear sorting time can not be obtained. Among non-comparison-based sorting, bucket sorting distributes a group of records with similar keys into the appropriate “bucket”, then another sorting algorithm is applied to the records in each bucket. With bucket sorting, the partition of records into m buckets is less time consuming, while only a few records will be contained in each bucket so that “cleanup sorting” algorithm can be applied very fast. Therefore, bucket sorting has the potential to asymptotically save sorting time compared with Ω (nlogn) algorithms. Obviously, how to uniformly distribute all records into buckets plays a critical role in bucket sorting. Hence what you need is a method to construct a hash function according to data distribution, which is used to uniformly distribute n records into n buckets based on the key of each record. Hence, the sorting time of the proposed bucket sorting algorithm will reach O(n) under any circumstance.

check this paper: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5170434&tag=1

answered Oct 11 '22 23:10

AhmadAssaf

Related questions
                            
                                Intuitive explanation for why QuickSort is n log n?
                            
                                Find the least number of coins required that can make any change from 1 to 99 cents
                            
                                Javascript data structures library [closed]
                            
                                Visualization of calendar events. Algorithm to layout events with maximum width
                            
                                How can I print out all possible letter combinations a given phone number can represent?
                            
                                Finding the median of an unsorted array
                            
                                How to find the intersection point between a line and a rectangle?
                            
                                Find the first element in a sorted array that is greater than the target
                            
                                What is difference between BFS and Dijkstra's algorithms when looking for shortest path?
                            
                                What algorithm for a tic-tac-toe game can I use to determine the "best move" for the AI?
                            
                                What are the pitfalls in implementing binary search? [closed]
                            
                                Difference and advantages between dijkstra & A star [duplicate]
                            
                                Group array by subarray values
                            
                                What are the core mathematical concepts a good developer should know? [closed]
                            
                                Python Inverse of a Matrix
                            
                                How to check if an integer is a power of 3?
                            
                                How to approach a number guessing game (with a twist) algorithm?
                            
                                What is the complexity of the sorted() function?
                            
                                Understanding "median of medians" algorithm
                            
                                Algorithm to find articles with similar text

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Sorting algorithms for data of known statistical distribution?

Tags:

performance

algorithm

complexity-theory

sorting

statistics