Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl: Find maximum value of a hash and compute averages

After a big break of ~6 months I am back in the world of Perl and Bioinformatics, interning under a different scientist. But the very first assignment is unlike any I had encountered last time, so while I have made some progress, I haven't been able to tackle the problem in its entirety. I am also trying to revise whatever I learnt last time as fast as possible, because I completely lost touch with programming these last 6 months. The dataset looks like the following:

NR_046018   DDX11L1     ,   0   0   1   1   1   1   1   1   1   1   0   0   0   0   1.44    2.72    3.84    4.92    5.6 6.64    7.08    9.12    9.56    8.28    7.16    6.08    5.4 4.36    3.92    1.88    0   0   0.76    1   1   1   1.2 2   2   2   1.72    2   2   2   1.8 1   1.88    2.4 3   3.36    5   6   6   6.72    6.12    5.6 5.44    5.56    5   4.04    5   4.28    4   4   3.08    2.08    1.68    1.96    1.44    3   3.68    4   4.16    5   4.32    4.8 6.16    6   6.28    6.92    7.84    7   7.32    7.2 5.96    5   4.52    4.08    3   3   4.04    4.12    4.44    4   3.52    3.4 4   4   2.64    1.88    1   1   1   0.64    1   1   1.24    2   2.92    3   3   2.96    2   2   2.56    2   1.08    2.12    3   3   3   3   2.6 3   4.64    3.88    3.72    4   4   4.96    4.6 4   2.36    2   1.28    1   1   0.04    0   0.24    1.08    2.68    3.84    4.12    5.72    6   6   5.76    4.92    3.32    3.12    2.88    2.08    2   2   2   2   2   1.44    2.92    3.04    4.28    5.8 7.8 9.48    10.52   13.04   12.08   11.6    11.72   11  9.2 7.52    7.12    7.08    7.08    8.32    7   6.6 7.6 8.04    8.36    6.72    7.88    7.72    8.4 9.24    8.88    8.96    9.88    10.08   9.24    9.28    10.16   11.04   10.52   10  8.56    8   7.8 7.72    6.44    4.32    4   4   3.72    3.68    3.68    3.28    5.56    7.36    9.48    10  10.52   11  12.16   11.96   9.44    8.64    7.52    7   6.48    6   5   5.12    6.28    6   5.52    6   6.68    6.08    7.52    8.16    7.72    8.52    8.56    9.2 9.16    8.92    7.44    6   5   3.48    2.92    2.16    2   2   1.2 1   1   1   1.24    1.64    1   1   1.96    2   2   2   1.76    1   1   1   0.52    1.76    3.64    5.12    6   6   6   6   5.52    4.24    2.36    0.88    0   0   0.68    1   1   1   1   1   1   1   0.32    0   0   1   1   1.44    2.44    3.68    5.4 6.88    7   6   6.52    6.76    6.56    5.32    3.6 2.92    3   3.72    3.96    3.8 3   3   3   2.2 2.4 2.28    1.52    1   1   1   1.72    2   1.6 1   1   1   1   1   0.28    0.92    2   2   2.72    3.64    4   4.84    5   4.08    3   3   2.68    2.36    2   1.16    1   1   2   4.92    4.6 4   4   4   4   4.32    4   1.08    1   1.52    2   2   2   1.68    1   1   1.32    1.48    1   1   1.52    2   2   2   1.68    1   1   1.88    1.48    1   1   1   1   1   1   0.12    0.4 1   1   1.2 3.88    4   5   5   4.6 4   4   3.8 2.08    2   1   1   1.44    2.4 3
NR_047520   LOC643837   ,   3   2.2 0.2 0   0   0.28    1   1   1   1   2.2 4.8 5   5.32    5   5   5   5   3.8 1.2 1   0.4 0   0   0   0   0   1   1   1   1   1   1   1   1.56    1   1   1   1   1   1   1   0.44    0.68    1   1.52    3   3.6 4.96    6.8 9   8.32    8.72    8.48    7   7.4 8.8 7.92    7.12    8.84    8.56    9.4 10.2    10  7.24    6.44    6.76    6.16    5.72    4.96    4.8 5.16    6   5.84    4.12    3   3   2.64    2.56    3.08    3   4.16    5   6.72    7   7.16    7.44    5.76    5   4.56    4   3.68    5   5.4 5.52    6   6   5.28    5   3.6 2   2.08    1.48    1   2   2   2   2   2   1.36    1   1   0   0   0.68    1   1   1   1   1   1   1   0.32    0   0   0   1.16    2   2   2   2   2.88    3   3   1.84    1   2   2   2.04    2.12    2   2   2   2   1   1.28    1.96    1.36    2.76    3   3   3   3   2.72    2   1.64    0.76    1   1.36    2   2   2   2   2   1.48    1   0.64    0   0.08    1   1   1.08    2   2   2   2   2.68    2   2   2.16    3.4 4   4   4.2 4.24    4   5.68    6.52    4.6 4   4   3.8 3.8 4   3.12    2.24    2.6 3   4   4   3.2 3   2.2 2   1.4 1.84    1.24    2   2   2   2   2   2   1.16    0.76    0   0   0   0   0   0   0   0.36    1   1.68    2   2   2.92    5.4 6.76    7.64    7   6.88    7   7.36    7.92    6.24    5.92    7.04    9.52    11.52   12.88   14.8    16.36   19.88   22.24   20  19.36   16.92   15.24   13.84   10.88   8.24    5.08    4.96    3.12    3   2.88    2   2.8 2.96    4   4.44    5   6   6   6   5.12    3.28    2   1.56    1   0.08    1.68    2   2   2.84    3   3   3.8 3.92    2.32    2   2.2 2.16    2   2   1.2 1   1   1   0.8 0   0   0   0.72    2.88    3   3   3   3   3   3   2.28    0.12    0   0.52    1   1   1   1   1.44    2   2   1.48    1   1   1   1.56    1.56    1   1   1   1   1   1   0.44    0.8 1.48    3   3   3   3   3   3.56    3.2 2.76    2   2   2   2   2.68    2.44    2   1.76    1   1.4 2   2   1.56    2   2   2   2   2.04    2   2   1.76    1   1   1   1   0.56    0   0   0   0   0   0   0   0   0.72    1.52    2   2   2   2   2   2   1.28    0.48                                                                            

1. What is needed

  1. For each row in the data file, find the maximum value from the range of numbers.
  2. Once the maximum has been found for all the rows, find average maximum.

2. Strategy I was thinking

  1. Separate the non numerical part from the non-numerical part into "keys" of a hash.
  2. Put the numerical part into the "values" of a hash.
  3. Assign the "values" into array @values
  4. Use module use List::Util qw(max) to find maximum value from the array
  5. Store these maximum values in another array and find average from this array.

3. Code written so far

use warnings;
use List::Util qw(max);

#Input filename
$file = 'test1.data';

#Open file
open I, '<', $file or die;

#Separate data into keys and values, based on ','
chop (%hash = map { split /\s*,\s*/,$_,2 } grep (!/^$/,<I>));
print "$_ => $hash{$_}\n" for keys %hash; #Code is working fine till here

#Create a values array
@values = values %hash;
foreach $value(@values){
 print "The values are : ", $value,"\n";
}

4. The Problem

Beyond this, I am not able to figure out how to add each "individual" array element into a new array so that I may use the max function.

What I mean is that for example, the first array element in @values contains data like 0 0 1 1 3 4.4. The second array element might have data like 3 2.2 0.28 1 1 4.8. So I need to put each of these array elements into a new array, each element going into a different array so that I may be able to use the max function.

5. Points to Note

  1. Most of the rows contain 400 numbers, some have a little less than that, but never more than 400.

  2. There are a total of 23,558 rows.

  3. File is a .txt file and all the numbers in each row are tab delimited.

I would be grateful to anyone who would be kind enough to point me in the right direction, or perhaps provide a better code to tackle the problem as mentioned in 1.

like image 296
Neal Avatar asked Dec 19 '12 09:12

Neal


People also ask

How do you find the highest number in an array in Perl?

use List::Util qw( min max ); my $min = min @numbers; my $max = max @numbers; But List::MoreUtils's minmax is more efficient when you need both the min and the max (because it does fewer comparisons).


1 Answers

If I understand your problem correctly you're making it overly complicated:

#!/usr/bin/env perl
use strict;
use warnings;
use List::Util qw(max);

#Input filename
my $file = 'test1.data';

#Open file
open my $fh, '<', $file or die "Unable to open $file: $!\n";

my ($total, $num);

while (<$fh>) {
    my @values = split;
    my $max = max(@values[3 .. $#values]);
    $total += $max;
    $num++;
}

my $average_max = $total / $num;

Just make one pass over your file, splitting the lines into an array and feeding everything from index 3 to max. Add $max to $total for each line, increment a counter ($num) and calculate average max from that.

You should also always use use strict and lexical filehandles.

like image 187
flesk Avatar answered Oct 19 '22 03:10

flesk