Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to get multiple matches with difflib.SequenceMatcher?

I am using difflib to identify all the matches of a short string in a longer sequence. However it seems that when there are multiple matches, difflib only returns one:

> sm = difflib.SequenceMatcher(None, a='ACT', b='ACTGACT')
> sm.get_matching_blocks()
[Match(a=0, b=0, size=3), Match(a=3, b=7, size=0)]

The output I expected was:

[Match(a=0, b=0, size=3), Match(a=0, b=4, size=3), Match(a=3, b=7, size=0)]

In fact the string ACTGACT contains two matches of ACT, at positions 0 and 4, both of size 3 (plus another match of size 0 at the end of the strings).

How can I get multiple matches? I was expecting difflib to return both positions.

like image 952
dalloliogm Avatar asked Sep 28 '22 16:09

dalloliogm


1 Answers

Why would you use difflib for that? You should be able to just use standard regular expressions.

import re
pattern = "ACT"
text = "ACTGACT"
matches = [m.span() for m in re.finditer(pattern, text)]

which will give you:

[(0, 3), (4, 7)]

Or does this for some reason not include the information that you are interested in? It of course does not return the last empty match that difflib returns but you could easily just create that.

like image 57
k-nut Avatar answered Oct 03 '22 18:10

k-nut