Given pairs of string like this. <pre class="prettyprint"><code> my $s1 = "ACTGGA"; my $s2 = "AGTG-A"; # Note the string can be longer than this. </code></pre> I would like to find position and character in in <code>$s1</code> where it differs with <code>$s2</code>. In this case the answer would be: <pre class="prettyprint"><code>#String Position 0-based # First col = Base in S1 # Second col = Base in S2 # Third col = Position in S1 where they differ C G 1 G - 4 </code></pre> I can achieve that easily with <code>substr()</code>. But it is horribly slow. Typically I need to compare millions of such pairs. Is there a fast way to achieve that?

Stringwise ^ is your friend: <pre class="prettyprint"><code>use strict; use warnings; my $s1 = "ACTGGA"; my $s2 = "AGTG-A"; my $mask = $s1 ^ $s2; while ($mask =~ /[^\0]/g) { print substr($s1,$-[0],1), ' ', substr($s2,$-[0],1), ' ', $-[0], "\n"; } </code></pre> EXPLANATION: The <code>^</code> (exclusive or) operator, when used on strings, returns a string composed of the result of an exclusive or on each bit of the numeric value of each character. Breaking down an example into equivalent code: <pre class="prettyprint"><code>"AB" ^ "ab" ( "A" ^ "a" ) . ( "B" ^ "b" ) chr( ord("A") ^ ord("a") ) . chr( ord("B") ^ ord("b") ) chr( 65 ^ 97 ) . chr( 66 ^ 98 ) chr(32) . chr(32) " " . " " " " </code></pre> The useful feature of this here is that a nul character (<code>"\0"</code>) occurs when and only when the two strings have the same character at a given position. So <code>^</code> can be used to efficiently compare every character of the two strings in one quick operation, and the result can be searched for non-nul characters (indicating a difference). The search can be repeated using the /g regex flag in scalar context, and the position of each character difference found using <code>$-[0]</code>, which gives the offset of the beginning of the last successful match.

Use binary bit ops on the complete strings. Things like <code>$s1 & $s2</code> or <code>$s1 ^ $s2</code> run incredibly fast, and work with strings of arbitrary length.

Fast Way to Find Difference between Two Strings of Equal Length in Perl

Tags:

string

linux

unix

perl

Given pairs of string like this.

    my $s1 = "ACTGGA";
    my $s2 = "AGTG-A";

   # Note the string can be longer than this.

I would like to find position and character in in $s1 where it differs with $s2. In this case the answer would be:

#String Position 0-based
# First col = Base in S1
# Second col = Base in S2
# Third col = Position in S1 where they differ
C G 1
G - 4

I can achieve that easily with substr(). But it is horribly slow. Typically I need to compare millions of such pairs.

Is there a fast way to achieve that?

867

asked Jan 17 '11 02:01

neversaint

2 Answers

Stringwise ^ is your friend:

use strict;
use warnings;
my $s1 = "ACTGGA";
my $s2 = "AGTG-A";

my $mask = $s1 ^ $s2;
while ($mask =~ /[^\0]/g) {
    print substr($s1,$-[0],1), ' ', substr($s2,$-[0],1), ' ', $-[0], "\n";
}

EXPLANATION:

The ^ (exclusive or) operator, when used on strings, returns a string composed of the result of an exclusive or on each bit of the numeric value of each character. Breaking down an example into equivalent code:

"AB" ^ "ab"
( "A" ^ "a" ) . ( "B" ^ "b" )
chr( ord("A") ^ ord("a") ) . chr( ord("B") ^ ord("b") )
chr( 65 ^ 97 ) . chr( 66 ^ 98 )
chr(32) . chr(32)
" " . " "
"  "

The useful feature of this here is that a nul character ("\0") occurs when and only when the two strings have the same character at a given position. So ^ can be used to efficiently compare every character of the two strings in one quick operation, and the result can be searched for non-nul characters (indicating a difference). The search can be repeated using the /g regex flag in scalar context, and the position of each character difference found using $-[0], which gives the offset of the beginning of the last successful match.

135

answered Oct 07 '22 00:10

ysth

Use binary bit ops on the complete strings.

Things like $s1 & $s2 or $s1 ^ $s2 run incredibly fast, and work with strings of arbitrary length.

answered Oct 07 '22 02:10

tchrist

Related questions
                            
                                How do I get the correct .config file for compiling the Linux kernel source specific to my hardware?
                            
                                how to ignore blank lines and comment lines using awk
                            
                                Getting Java to sleep between loops, sleep time designated by command line on Linux
                            
                                How to edit 300 GB text file (genomics data)?
                            
                                !#/bin/bash: No such file or directory
                            
                                Bash script: always show menu after loop execution
                            
                                How to create user in linux by providing uid and gid options? [closed]
                            
                                How can I sort a 10GB file?
                            
                                CPAN giving all sorts of errors on ubuntu
                            
                                How to get the percent of packets received from Ping in bash?
                            
                                Bash - for i in cat?
                            
                                find all audio files in a folder using bash in linux
                            
                                STDERR redirection to STDOUT lost if backticks can't exec
                            
                                How to read a file backwards on Linux? [closed]
                            
                                How to convert a .java or a .jar file into a Linux executable file ( without a .jar extension, which means it's not a .jar file )
                            
                                how do I update cuDNN to a newer version?
                            
                                * on the linux command line
                            
                                ImageMagick installation on Debian
                            
                                Installing MongoDB via yum on AWS linux fails: HTTPS Error 404 - Not Found
                            
                                non-IDE C development environment

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With