How can I extend the code below to allow me to explore all instances where I have 2 mismatches or less between my substring and the parent string?
Substring: SSQP
String-to-match-to: SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ
Here is an example where only one possible mismatch is incorporated:
>>> s = 'SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ'
>>> re.findall(r'(?=(SSQP|[A-Z]SQP|S[A-Z]QP|SS[A-Z]P|SSQ[A-Z]))', s)
['SSQQ', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP']
Obviously, incorporating the possibility of two mismatches in the code above would require a lot of brute-force typing of all the possible combinations.
How can I extend this code (or refactor this code) to explore the possibility of two mismatches?
Furthermore, I want to modify my output so that I get the numeric index returned (not SSQQ
or SSQP
) of the exact position the substring matched the string.
Use the in operator for partial matches, i.e., whether one string contains the other string. x in y returns True if x is contained in y ( x is a substring of y ), and False if it is not. If each character of x is contained in y discretely, False is returned.
String Equals Check in Python In python programming we can check whether strings are equal or not using the “==” or by using the “. __eq__” function. Example: s1 = 'String' s2 = 'String' s3 = 'string' # case sensitive equals check if s1 == s2: print('s1 and s2 are equal.
A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).
You don't have to use re
here you can use itertools
module instead and save a lot of memory.
You can first extract all sub-strings with length 4 then compare them with your substring
and just select those that have less that 2 difference with your substring
:
from itertools import izip,islice,tee
def sub_findre(s,substring,diffnumber):
sublen=len(substring)
zip_gen=(izip(substring,islice(s,i,i+sublen)) for i in xrange(len(s)))
for z in zip_gen:
l,z=tee(z)
if sum(1 for i,j in l if i==j)>=sublen-diffnumber:
new=izip(*z)
next(new)
yield ''.join(next(new))
Demo:
s='SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ'
substring='SSQP'
print list(sub_findre(s,substring,2))
['SSPQ', 'SPQQ', 'QQQP', 'SSSS', 'SSSQ', 'SSQQ', 'SQQQ', 'SSQP', 'PSQS', 'SSQP', 'SSQP', 'SQPP', 'SSSS', 'SSSQ', 'SSQP', 'PSQS', 'SSQP', 'SSSS', 'SSSQ', 'SSQP', 'PSQS', 'SSQP', 'SSSS', 'SSSQ', 'SSQP', 'PSQ']
If you want to return the indices you need to put the indices in izip
which you can use itertools.repeat()
to repeat the index with the length of substring
:
from itertools import izip,islice,tee,repeat
def sub_findre(s,substring,diffnumber):
sublen=len(substring)
zip_gen=(izip(substring,islice(s,i,i+sublen),repeat(i,sublen)) for i in xrange(len(s)))
for z in zip_gen:
l,z=tee(z)
if sum(1 for i,j,_ in l if i==j)>=sublen-diffnumber:
new=izip(*z)
next(new)
next(new)
yield next(new)[0]
Demo:
print list(sub_findre(s,substring,2))
[0, 1, 4, 8, 9, 10, 11, 15, 20, 23, 27, 28, 32, 33, 34, 39, 42, 46, 47, 48, 53, 56, 60, 61, 62, 67]
The combinatorial explosion is not that bad for two mismatches out of four.
First, observe that you can omit SSQP
itself, since it's covered by all of the more lenient cases.
re.findall(r'(?=([A-Z]SQP|S[A-Z]QP|SS[A-Z]P|SSQ[A-Z]))', s)
So, the number of cases is
4!
C(4, 1) = ––––––––––––– = 4
1! * (4 - 1)!
For up to two mismatches, the number of cases is
4!
C(4, 2) = ––––––––––––– = 6
2! * (4 - 2)!
Namely,
re.findall('(?=(SS..|S.Q.|S..P|.SQ.|.S.P|..QP))', s)
(To simplify the illustration, I've taken the liberty of writing .
instead of [A-Z]
.)
To get the positions of the matches instead of the text of the matches:
[match.start(0) for match in re.finditer('(?=(SS..|S.Q.|S..P|.SQ.|.S.P|..QP))', s)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With