Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast Way to Find Difference between Two Strings of Equal Length in Perl

Given pairs of string like this.

    my $s1 = "ACTGGA";
    my $s2 = "AGTG-A";

   # Note the string can be longer than this.

I would like to find position and character in in $s1 where it differs with $s2. In this case the answer would be:

#String Position 0-based
# First col = Base in S1
# Second col = Base in S2
# Third col = Position in S1 where they differ
C G 1
G - 4

I can achieve that easily with substr(). But it is horribly slow. Typically I need to compare millions of such pairs.

Is there a fast way to achieve that?

like image 867
neversaint Avatar asked Jan 17 '11 02:01

neversaint


People also ask

How do I check if two strings are equal in Perl?

'eq' operator in Perl is one of the string comparison operators used to check for the equality of the two strings. It is used to check if the string to its left is stringwise equal to the string to its right.

What is == in Perl?

"== does a numeric comparison: it converts both arguments to a number and then compares them."

How do you find the difference between two strings?

difference() returns the difference between two strings, returning the portion of the second string, which starts to differ from the first. StringUtils. indexOfDifference() returns the index at which the second string starts to diverge from the first.

How do you find the length of a string in Perl?

length() function in Perl finds length (number of characters) of a given string, or $_ if not specified. Return: Returns the size of the string.


2 Answers

Stringwise ^ is your friend:

use strict;
use warnings;
my $s1 = "ACTGGA";
my $s2 = "AGTG-A";

my $mask = $s1 ^ $s2;
while ($mask =~ /[^\0]/g) {
    print substr($s1,$-[0],1), ' ', substr($s2,$-[0],1), ' ', $-[0], "\n";
}

EXPLANATION:

The ^ (exclusive or) operator, when used on strings, returns a string composed of the result of an exclusive or on each bit of the numeric value of each character. Breaking down an example into equivalent code:

"AB" ^ "ab"
( "A" ^ "a" ) . ( "B" ^ "b" )
chr( ord("A") ^ ord("a") ) . chr( ord("B") ^ ord("b") )
chr( 65 ^ 97 ) . chr( 66 ^ 98 )
chr(32) . chr(32)
" " . " "
"  "

The useful feature of this here is that a nul character ("\0") occurs when and only when the two strings have the same character at a given position. So ^ can be used to efficiently compare every character of the two strings in one quick operation, and the result can be searched for non-nul characters (indicating a difference). The search can be repeated using the /g regex flag in scalar context, and the position of each character difference found using $-[0], which gives the offset of the beginning of the last successful match.

like image 135
ysth Avatar answered Oct 07 '22 00:10

ysth


Use binary bit ops on the complete strings.

Things like $s1 & $s2 or $s1 ^ $s2 run incredibly fast, and work with strings of arbitrary length.

like image 37
tchrist Avatar answered Oct 07 '22 02:10

tchrist