I have a perl script, that goes through a couple of gig worth of files and generates a report.
In order to calculate percentile i am doing the following
my @values = 0;
while (my $line = <INPUTFILE>){
.....
push(@values, $line);
}
# Sort
@values = sort {$a <=> $b} @values;
# Print 95% percentile
print $values[sprintf("%.0f",(0.95*($#values)))];
This obviously saves all the values upfront in a array and then calculates the percentile, which can be heavy on memory (assuming millions of values), is there a more memory efficient way of doing this?
You can process the file twice: the first run only count the number of lines ($.
). From that number, you can count the size of the sliding window which will only keep the highest numbers needed to find the percentile (for percentiles < 50, you should invert the logic).
#!/usr/bin/perl
use warnings;
use strict;
my $percentile = 95;
my $file = shift;
open my $IN, '<', $file or die $!;
1 while <$IN>; # Just count the number of lines.
my $line_count = $.;
seek $IN, 0, 0; # Rewind.
# Calculate the size of the sliding window.
my $remember_count = 1 + (100 - $percentile) * $line_count / 100;
# Initialize the window with the first lines.
my @window = sort { $a <=> $b }
map scalar <$IN>,
1 .. $remember_count;
chomp @window;
while (<$IN>) {
chomp;
next if $_ < $window[0];
shift @window;
my $i = 0;
$i++ while $i <= $#window and $window[$i] <= $_;
splice @window, $i, 0, $_;
}
print "$window[0]\n";
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With