Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does Java 6 Arrays#sort(Object[]) change from mergesort to insertionsort for small arrays?

Java 6's mergesort implementation in Arrays.java uses an insertion-sort if the array length is less than some threshold. This value is hard-coded to 7. As the algorithm is recursive, this eventually happens many times for a large array. The canonical merge-sort algorithm does not do this, just using merge-sort all the way down until there is only 1 element in the list.

Is this an optimisation? If so, how is it supposed to help? And why 7? The insertion sort (even of <=7 things) increases the number of comparisons required to sort a large array dramatically - so will add cost to a sort where compareTo() calls are slow.

array-size vs #-of-comparisons for different values of INSERTIONSORT_THRESHOLD

(x-axis is size of array, y-axis is # of comparisons, for different values of INSERTIONSORT_THRESHOLD)

like image 788
Matthew Gilliard Avatar asked Jul 11 '11 12:07

Matthew Gilliard


3 Answers

Yes this is intentional. While the Big-O of mergesort is less than that of quadratic sorts such as insertion sort, the operations it does are more complex and thus slower.

Consider sorting an array of length 8. Merge sort makes ~14 recursive calls to itself in addition to 7 merge operations. Each recursive call contributes some non-trivial overhead to the run-time. Each merge operation involves a loop where index variables must be initialized, incremented, and compared, temporary arrays must be copied, etc. All in all, you can expect well over 300 "simple" operations.

On the other hand, insertion sort is inherently simple and uses about 8^2=64 operations which is much faster.

Think about it this way. When you sort a list of 10 numbers by hand, do you use merge sort? No, because your brain is much better at doing simple things like like insertion sort. However if I gave you a year to sort a list of 100,000 numbers, you might be more inclined to merge sort it.

As for the magic number 7, it is empirically derived to be optimal.

EDIT: In a standard insertion sort of 8 elements, the worst case scenario leads to ~36 comparisons. In a canonical merge sort, you have ~24 comparisons. Adding in the overhead from the method calls and complexity of operations, insertion sort should be faster. Additionally if you look at the average case, insertion sort would make far fewer comparisons than 36.

like image 59
tskuzzy Avatar answered Oct 18 '22 20:10

tskuzzy


Insertion sort is n(n-1)/2 and merge sort is n*(log n with base 2 ).

Considering this -

  1. For Array of Length 5 => Insetion sort = 10 and merge sort is 11.609
  2. For Array of Length 6 => Insetion sort = 15 and merge sort is 15.509
  3. For Array of Length 7 => Insetion sort = 21 and merge sort is 19.651
  4. For Array of Length 8 => Insetion sort = 28 and merge sort is 24

From above data it is clear, till length 6, insetion sort is faster and after 7, merge sort is efficient.

That explains why 7 is used.

like image 43
user1289117 Avatar answered Oct 18 '22 21:10

user1289117


My understanding is that this is an empirically derived value, where the time required for an insertion sort is actually lower, despite a (possible) higher number of comparisons required. This is so because near the end of a mergesort, the data is likely to be almost sorted, which makes insertion sort perform well.

like image 32
dlev Avatar answered Oct 18 '22 19:10

dlev