I am trying to perform some composition-based filtering on a large collection of strings (protein sequences).
I wrote a group of three subroutines in order to take care of it, but I'm running into trouble in two ways - one minor, one major. The minor trouble is that when I use List::MoreUtils 'pairwise' I get warnings about using $a
and $b
only once and them being uninitialized. But I believe I'm calling this method properly (based on CPAN's entry for it and some examples from the web).
The major trouble is an error "Can't use string ("17/32") as HASH ref while "strict refs" in use..."
It seems like this can only happen if the foreach
loop in &comp
is giving the hash values as a string instead of evaluating the division operation. I'm sure I've made a rookie mistake, but can't find the answer on the web. The first time I even looked at perl code was last Wednesday...
use List::Util;
use List::MoreUtils;
my @alphabet = (
'A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I',
'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V'
);
my $gapchr = '-';
# Takes a sequence and returns letter => occurrence count pairs as hash.
sub getcounts {
my %counts = ();
foreach my $chr (@alphabet) {
$counts{$chr} = ( $_[0] =~ tr/$chr/$chr/ );
}
$counts{'gap'} = ( $_[0] =~ tr/$gapchr/$gapchr/ );
return %counts;
}
# Takes a sequence and returns letter => fractional composition pairs as a hash.
sub comp {
my %comp = getcounts( $_[0] );
foreach my $chr (@alphabet) {
$comp{$chr} = $comp{$chr} / ( length( $_[0] ) - $comp{'gap'} );
}
return %comp;
}
# Takes two sequences and returns a measure of the composition difference between them, as a scalar.
# Originally all on one line but it was unreadable.
sub dcomp {
my @dcomp = pairwise { $a - $b } @{ values( %{ comp( $_[0] ) } ) }, @{ values( %{ comp( $_[1] ) } ) };
@dcomp = apply { $_ ** 2 } @dcomp;
my $dcomp = sqrt( sum( 0, @dcomp ) ) / 20;
return $dcomp;
}
Much appreciation for any answers or advice!
There are a few bugs in your code. First, note from perldoc perlop:
Because the transliteration table is built at compile time, neither the
SEARCHLIST
nor theREPLACEMENTLIST
are subjected to double quote interpolation.
So your counting method is incorrect. I also believe you are misusing pairwise
. It is hard to evaluate what constitutes correct usage because you do not give examples of what output you should get with some simple inputs.
In any case, I would re-write this script as (there are some debugging statements sprinkled in):
#!/usr/bin/perl
use List::AllUtils qw( sum );
use YAML;
our ($a, $b);
my @alphabet = ('A' .. 'Z');
my $gap = '-';
my $seq1 = 'ABCD-EFGH--MNOP';
my $seq2 = 'EFGH-ZZZH-KLMN';
print composition_difference($seq1, $seq2);
sub getcounts {
my ($seq) = @_;
my %counts;
my $pattern = join '|', @alphabet, $gap;
$counts{$1} ++ while $seq =~ /($pattern)/g;
warn Dump \%counts;
return \%counts;
}
sub fractional_composition_pairs {
my ($seq) = @_;
my $comp = getcounts( $seq );
my $denom = length $seq - $comp->{$gap};
$comp->{$_} /= $denom for @alphabet;
warn Dump $comp;
return $comp;
}
sub composition_difference {
# I think your use of pairwise in the original script
# is very buggy unless every sequence always contains
# all the letters in the alphabet and the gap character.
# Is the gap character supposed to factor in the computations here?
my ($comp1, $comp2) = map { fractional_composition_pairs($_) } @_;
my %union;
++ $union{$_} for (keys %$comp1, keys %$comp2);
my $dcomp;
{
no warnings 'uninitialized';
$dcomp = sum map {
($comp1->{$_} - $comp2->{$_}) ** 2
} keys %union;
}
return sqrt( $dcomp ) / 20; # where did 20 come from?
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With