Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get position of subsequence using Levenshtein-Distance

I have a huge number of records containing sequences ('ATCGTGTGCATCAGTTTCGA...'), up to 500 characters. I also have a list of smaller sequences, usually 10-20 characters. I would like to use the Levenshtein distance in order to find these smaller sequences inside the records allowing for small changes or indels (L_distance <=2).

The problem is that I also want to get the start position of such smaller sequences, and apparently it only compares sequences of the same length.

>>> import Levenshtein
>>> s1 = raw_input('first word: ')
first word: ATCGTAATACGATCGTACGACATCGCGGCCCTAGC
>>> s2 = raw_input('second word: ')
first word: TACGAT
>>> Levenshtein.distance(s1,s2)
29

In this example I would like to obtain the position (7) and the distance (0 in this case).

Is there an easy way to solve this problem, or do I have to break the larger sequences into smaller ones and then run the Levenshtein distance for all of them? That might take too much time.

Thanks.

UPDATE #Naive implementation generating all substrings after looking for an exact match.

def find_tag(pattern,text,errors):       
    m = len(pattern)
    i=0
    min_distance=errors+1
    while i<=len(text)-m:
        distance = Levenshtein.distance(text[i:i+m],pattern)
        print text[i:i+m],distance #to see all matches.
        if distance<=errors:
            if distance<min_distance:
                match=[i,distance]
                min_distance=distance
        i+=1
    return match

#Real example. In this case just looking for one pattern, but we have about 50.
import re, Levenshtein

text = 'GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACATTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACACAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAGGGGATAGG' #Example of a record
pattern = 'TGCACTGTAGGGATAACAAT' #distance 1
errors = 2 #max errors allowed

match = re.search(pattern,text)

if match:
    print [match.start(),0] #First we look for exact match
else:
    find_tag(pattern,text,errors)
like image 287
biojl Avatar asked Nov 01 '13 10:11

biojl


1 Answers

Assuming the maximum Levenshtein distance allowed is small, this can be done in a single-pass while keeping a list of candidates for fuzzy matches.

Here is an example implementation I just worked up. It's not thoroughly tested, documented or optimized. But at least it works for simple examples (see below). I've tried to avoid having it return several matches due to skipping characters at the edges of the sub-sequence, but as I've said, I haven't thoroughly tested this.

If you're interested I'll be happy to clean this up, write some tests, do basic optimization and make this available as an open-source library.

from collections import namedtuple

Candidate = namedtuple('Candidate', ['start', 'subseq_index', 'dist'])
Match = namedtuple('Match', ['start', 'end', 'dist'])

def find_near_matches(subsequence, sequence, max_l_dist=0):
    prev_char = None
    candidates = []
    for index, char in enumerate(sequence):
        for skip in range(min(max_l_dist+1, len(subsequence))):
            candidates.append(Candidate(index, skip, skip))
            if subsequence[skip] == prev_char:
                break
        new_candidates = []
        for cand in candidates:
            if char == subsequence[cand.subseq_index]:
                if cand.subseq_index + 1 == len(subsequence):
                    yield Match(cand.start, index + 1, cand.dist)
                else:
                    new_candidates.append(cand._replace(
                        subseq_index=cand.subseq_index + 1,
                    ))
            else:
                if cand.dist == max_l_dist or cand.subseq_index == 0:
                    continue
                # add a candidate skipping a sequence char
                new_candidates.append(cand._replace(dist=cand.dist + 1))
                # try skipping subsequence chars
                for n_skipped in range(1, max_l_dist - cand.dist + 1):
                    if cand.subseq_index + n_skipped == len(subsequence):
                        yield Match(cand.start, index + 1, cand.dist + n_skipped)
                        break
                    elif subsequence[cand.subseq_index + n_skipped] == char:
                        # add a candidate skipping n_skipped subsequence chars
                        new_candidates.append(cand._replace(
                            dist=cand.dist + n_skipped,
                            subseq_index=cand.subseq_index + n_skipped,
                        ))
                        break
        candidates = new_candidates
        prev_char = char

Now:

>>> list(find_near_matches('bde', 'abcdefg', 0))
[]
>>> list(find_near_matches('bde', 'abcdefg', 1))
[Match(start=1, end=5, dist=1), Match(start=3, end=5, dist=1)]
>>> list(find_near_matches('cde', 'abcdefg', 0))
[Match(start=2, end=5, dist=0)]
>>> list(find_near_matches('cde', 'abcdefg', 1))
[Match(start=2, end=5, dist=0)]
>>> match = _[0]
>>> 'abcdefg'[match.start:match.end]
'cde'

EDIT:

Following this question, I'm writing a Python library for searching for nearly matching sub-sequences: fuzzysearch. It it still very much a work-in-progress.

For now, try the find_near_matches_with_ngrams() function! It should perform particularly well in your use-case.

like image 65
taleinat Avatar answered Oct 01 '22 06:10

taleinat