I have to sort a number of integers, which can have values between 30.000.000 and 350.000.000. There will be between 0 and 65.535 integers, with the average count being 20.000. RAM usage is irrelevant and speed only is important.
Later on i will also have to split them into groups, with the divide always being set whenever the gap between two of these values is >65.535, which is what i need the algorithm for.
If it makes any difference, the algorithm will be used in a Perl script.
Edit: After thinking it over and reading the answers i've come to realize something: I don't actually care about the data itself. As i really only want to find the start and end values of groups with small gaps, the sorting only needs to create buckets and can discard the data.
Edit2: After some testing and trying out the answers provided, the fastest way i found was this:
my @sort = sort {$a <=> $b} @item_offsets;
my @buckets;
my $start = shift @sort;
push @buckets, [$start,$start];
for my $item ( @sort ) {
if ( $item < $buckets[$#buckets][1]+$gap ) {
$buckets[$#buckets][1] = $item;
}
else {
push @buckets, [$item,$item];
}
}
say $#buckets;
I'd just make an array of buckets before running the algorithm, one for each group of 65536 consecutive values. The buckets will contain a min and max value of their contents, but won't store the contents themselves. After running the algorithm, do a single pass over the buckets. If there are two consecutive non-empty buckets with min(bucket2)-max(bucket1) < 65536, combine them. Combining will not happen until the algorithm finishes running. Discard any empty buckets. This algorithm is linear time.
Take note of Bucket Sort.
It's unlikely that you'll be able to write a sort algorithm in Perl that will perform better than Perl's builtin sort
function:
@numbers = sort {$a <=> $b} @numbers;
You can experiment with the sort pragma to see if a particular algorithm is better:
use sort '_quicksort';
use sort '_mergesort';
Since your cutpoints will vary depending on the data distribution, I think you need to sort the entire list first then loop over it to do the cutting.
my $prev = shift @numbers; # already sorted
my @group = [$prev];
my $i = 0;
foreach my $n (@numbers) {
$i++ if ($n - $prev > 65535);
push @{$group[$i]}, $n;
$prev = $n;
}
I'd use a radix sort, since you need to group the output.
I was just going to say radix sort, http://en.wikipedia.org/wiki/Radix_sort however that could be a bit above what you were looking to implement, Introsort is generally the accepted sorting solution for data http://en.wikipedia.org/wiki/Introsort, it's a variation of quicksort that switches to heapsort when it reaches smaller sets as it's faster on smaller sets than quicksort.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With