Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest Way To Find Mismatch Positions Between Two Strings of the Same Length

Tags:

string

perl

I have a millions of pairs of string of same length which I want to compare and find the position where it has mismatches.

For example for each $str1 and $str2 we want to find mismatch position with $str_source:

$str_source = "ATTCCGGG";

$str1       = "ATTGCGGG"; # 1 mismatch with Str1 at position 3 (0-based)
$str2       = "ATACCGGC"; # 2 mismatches with source at position  2 and 7

Is there a fast way to do it. Currently I have the C style method which I loop every position in both strings using 'substr' function. But this approach is horribly slow.

my @mism_pos;
for $i (0 .. length($str_source)) {
   $source_base = substr($str_source,$i,1);
   $str_base    = substr($str2,$i,$1);

  if ($source_base ne $str_base) {
     push @mism_pos,$i;
  }

}
like image 302
neversaint Avatar asked Nov 04 '09 10:11

neversaint


2 Answers

Inline::C


The computation is easy, do it with Inline::C (read perldoc Inline::C-Cookbook and perldoc Inline::C for documentation):

use Inline C => << '...';                                                       
  void find_diffs(char* x, char* y) {                                           
    int i;                                                                      
    Inline_Stack_Vars;                                                          
    Inline_Stack_Reset;                                                         
    for(i=0; x[i] && y[i]; ++i) {                                               
      if(x[i] != y[i]) {                                                        
        Inline_Stack_Push(sv_2mortal(newSViv(i)));                              
      }                                                                         
    }                                                                           
    Inline_Stack_Done;                                                          
  }                                                                             
...                                                                             

@diffs= find_diffs("ATTCCGGG","ATTGCGGG");  print "@diffs\n";                   
@diffs= find_diffs("ATTCCGGG","ATACCGGC");  print "@diffs\n";                   

Here is the output of this script:

> script.pl 
3
2 7

PDL

If you want to process a lot of data fast in Perl, learn PDL (Documentation):

use PDL; 
use PDL::Char;                                                                  
$PDL::SHARE=$PDL::SHARE; # keep stray warning quiet 

my $source=PDL::Char->new("ATTCCGGG");                                          
for my $str ( "ATTGCGGG", "ATACCGGC") {                                         
  my $match =PDL::Char->new($str);                                              
  my @diff=which($match!=$source)->list;                                        
  print "@diff\n";                                                              
}

(Same output as first script.)

Notes: I used PDL very happily in genomic data processing. Together with memory mapped access to data stored on the disk, huge amounts of data can be processed quickly: all processing is done in highly optimized C loops. Also, you can easily access the same data through Inline::C for any features missing in PDL.

Note however, that the creation of one PDL vector is quite slow (constant time, it's acceptable for large data structures). So, you rather want to create one large PDL object with all your input data in one go than looping over individual data elements.

like image 127
Yaakov Belch Avatar answered Jan 01 '23 19:01

Yaakov Belch


Those look like gene sequences. If the strings are all 8-characters, and the domain of possible codes is ( A, C, G, T ) you might consider transforming the data somehow before processing it. That would give you only 65536 possible strings, so you can specialise your implementation.

For example, you write a method that takes an 8-character string and maps it to an integer. Memoize that so that the operation will be quick. Next, write a comparison function, that given two integers, tells you how they differ. You would call this in a suitable looping construct with a numeric equality test like unless ( $a != $b ) before calling the comparison - a short circuit for identical codes if you will.

like image 21
martin clayton Avatar answered Jan 01 '23 20:01

martin clayton