I have a huge number of records containing sequences ('ATCGTGTGCATCAGTTTCGA...'), up to 500 characters. I also have a list of smaller sequences, usually 10-20 characters. I would like to use the Levenshtein distance in order to find these smaller sequences inside the records allowing for small changes or indels (L_distance <=2).
The problem is that I also want to get the start position of such smaller sequences, and apparently it only compares sequences of the same length.
>>> import Levenshtein
>>> s1 = raw_input('first word: ')
first word: ATCGTAATACGATCGTACGACATCGCGGCCCTAGC
>>> s2 = raw_input('second word: ')
first word: TACGAT
>>> Levenshtein.distance(s1,s2)
29
In this example I would like to obtain the position (7) and the distance (0 in this case).
Is there an easy way to solve this problem, or do I have to break the larger sequences into smaller ones and then run the Levenshtein distance for all of them? That might take too much time.
Thanks.
UPDATE #Naive implementation generating all substrings after looking for an exact match.
def find_tag(pattern,text,errors):
m = len(pattern)
i=0
min_distance=errors+1
while i<=len(text)-m:
distance = Levenshtein.distance(text[i:i+m],pattern)
print text[i:i+m],distance #to see all matches.
if distance<=errors:
if distance<min_distance:
match=[i,distance]
min_distance=distance
i+=1
return match
#Real example. In this case just looking for one pattern, but we have about 50.
import re, Levenshtein
text = 'GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACATTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACACAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAGGGGATAGG' #Example of a record
pattern = 'TGCACTGTAGGGATAACAAT' #distance 1
errors = 2 #max errors allowed
match = re.search(pattern,text)
if match:
print [match.start(),0] #First we look for exact match
else:
find_tag(pattern,text,errors)
Assuming the maximum Levenshtein distance allowed is small, this can be done in a single-pass while keeping a list of candidates for fuzzy matches.
Here is an example implementation I just worked up. It's not thoroughly tested, documented or optimized. But at least it works for simple examples (see below). I've tried to avoid having it return several matches due to skipping characters at the edges of the sub-sequence, but as I've said, I haven't thoroughly tested this.
If you're interested I'll be happy to clean this up, write some tests, do basic optimization and make this available as an open-source library.
from collections import namedtuple
Candidate = namedtuple('Candidate', ['start', 'subseq_index', 'dist'])
Match = namedtuple('Match', ['start', 'end', 'dist'])
def find_near_matches(subsequence, sequence, max_l_dist=0):
prev_char = None
candidates = []
for index, char in enumerate(sequence):
for skip in range(min(max_l_dist+1, len(subsequence))):
candidates.append(Candidate(index, skip, skip))
if subsequence[skip] == prev_char:
break
new_candidates = []
for cand in candidates:
if char == subsequence[cand.subseq_index]:
if cand.subseq_index + 1 == len(subsequence):
yield Match(cand.start, index + 1, cand.dist)
else:
new_candidates.append(cand._replace(
subseq_index=cand.subseq_index + 1,
))
else:
if cand.dist == max_l_dist or cand.subseq_index == 0:
continue
# add a candidate skipping a sequence char
new_candidates.append(cand._replace(dist=cand.dist + 1))
# try skipping subsequence chars
for n_skipped in range(1, max_l_dist - cand.dist + 1):
if cand.subseq_index + n_skipped == len(subsequence):
yield Match(cand.start, index + 1, cand.dist + n_skipped)
break
elif subsequence[cand.subseq_index + n_skipped] == char:
# add a candidate skipping n_skipped subsequence chars
new_candidates.append(cand._replace(
dist=cand.dist + n_skipped,
subseq_index=cand.subseq_index + n_skipped,
))
break
candidates = new_candidates
prev_char = char
Now:
>>> list(find_near_matches('bde', 'abcdefg', 0))
[]
>>> list(find_near_matches('bde', 'abcdefg', 1))
[Match(start=1, end=5, dist=1), Match(start=3, end=5, dist=1)]
>>> list(find_near_matches('cde', 'abcdefg', 0))
[Match(start=2, end=5, dist=0)]
>>> list(find_near_matches('cde', 'abcdefg', 1))
[Match(start=2, end=5, dist=0)]
>>> match = _[0]
>>> 'abcdefg'[match.start:match.end]
'cde'
EDIT:
Following this question, I'm writing a Python library for searching for nearly matching sub-sequences: fuzzysearch
. It it still very much a work-in-progress.
For now, try the find_near_matches_with_ngrams()
function! It should perform particularly well in your use-case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With