Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient substring matching in perl

I am looking for an efficient solution to do find the longest possible substring in a string tolerating n mismatches in the main string

Eg: Main String

  1. AGACGTACTACTCTACTAGATGCA*TACTCTAC*
  2. AGACGTACTACTCTACTAGATGCA*TACTCTAC*
  3. AGACGTACTACTCTACAAGATGCA*TACTCTAC*
  4. AGACGTACTACTTTACAAGATGCA*TACTCTAC*

Search String:

  1. TACTCTACT : this should be considered a match to all of the above main strings.

Also I there could be the case where part of the substring is at the end of main string and I would like to pick that up also.

I would appreciate if you could give some pointers.

PS: I will have one search string and about 100 million main strings to search the substring for.

Thanks! -Abhi

like image 929
Abhi Avatar asked Jun 06 '11 22:06

Abhi


2 Answers

I'm not certain that this is what you're after but BioPerl has an approximate-grep tool called Bio::Grep::Backend::Agrep:

Bio::Grep::Backend::Agrep searches for a query with agrep

And agrep is "approximate grep". AFAIK, this builds a database and then uses that database to make your searches faster so this sounds like what you're after.

Looks like you're working with DNA sequences so you probably have BioPerl already installed.

You could also try using String::Approx directly:

Perl extension for approximate matching (fuzzy matching)

I suspect that Bio::Grep::Backend::Agrep will be faster and a better match for your task though.

like image 79
mu is too short Avatar answered Sep 23 '22 07:09

mu is too short


use strict;
use warnings;
use feature qw( say );

sub match {
   my ($s, $t, $max_x) = @_;

   my $m = my @s = unpack('(a)*', $s);
   my $n = my @t = unpack('(a)*', $t);

   my @length_at_k     = ( 0 ) x ($m+$n);
   my @mismatches_at_k = ( 0 ) x ($m+$n);
   my $offset = $m;

   my $best_length = 0;
   my @solutions;
   for my $i (0..$m-1) {
      --$offset;

      for my $j (0..$n-1) {
         my $k = $j + $offset;

         if ($s[$i] eq $t[$j]) {
            ++$length_at_k[$k];
         }
         elsif ($length_at_k[$k] > 0 && $mismatches_at_k[$k] < $max_x) {
            ++$length_at_k[$k];
            ++$mismatches_at_k[$k];
         }
         else {
            $length_at_k[$k] = 0;
            $mismatches_at_k[$k] = 0;
         }

         my $length = $length_at_k[$k] + $max_x - $mismatches_at_k[$k];
         $length = $i+1 if $length > $i+1;

         if ($length >= $best_length) {
            if ($length > $best_length) {
               $best_length = $length;
               @solutions = ();
            }

            push @solutions, $i-$length+1;
         }
      }
   }

   return map { substr($s, $_, $best_length) } @solutions;
}

say for match('AABBCC', 'DDBBEE', 2);

Output:

AABB
ABBC
BBCC
like image 28
ikegami Avatar answered Sep 20 '22 07:09

ikegami