How can I extend the code below to allow me to explore all instances where I have 2 mismatches or less between my substring and the parent string? Substring: <code>SSQP</code> String-to-match-to: <code>SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ</code> Here is an example where only one possible mismatch is incorporated: <pre class="prettyprint"><code>>>> s = 'SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ' >>> re.findall(r'(?=(SSQP|[A-Z]SQP|S[A-Z]QP|SS[A-Z]P|SSQ[A-Z]))', s) ['SSQQ', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP'] </code></pre> Obviously, incorporating the possibility of two mismatches in the code above would require a lot of brute-force typing of all the possible combinations. How can I extend this code (or refactor this code) to explore the possibility of two mismatches? Furthermore, I want to modify my output so that I get the numeric index returned (not <code>SSQQ</code> or <code>SSQP</code>) of the exact position the substring matched the string.

The combinatorial explosion is not that bad for two mismatches out of four. First, observe that you can omit <code>SSQP</code> itself, since it's covered by all of the more lenient cases. <pre class="prettyprint"><code>re.findall(r'(?=([A-Z]SQP|S[A-Z]QP|SS[A-Z]P|SSQ[A-Z]))', s) </code></pre> So, the number of cases is <pre class="prettyprint"><code> 4! C(4, 1) = ––––––––––––– = 4 1! * (4 - 1)! </code></pre> For up to two mismatches, the number of cases is <pre class="prettyprint"><code> 4! C(4, 2) = ––––––––––––– = 6 2! * (4 - 2)! </code></pre> Namely, <pre class="prettyprint"><code>re.findall('(?=(SS..|S.Q.|S..P|.SQ.|.S.P|..QP))', s) </code></pre> (To simplify the illustration, I've taken the liberty of writing <code>.</code> instead of <code>[A-Z]</code>.) <hr> To get the positions of the matches instead of the text of the matches: <pre class="prettyprint"><code>[match.start(0) for match in re.finditer('(?=(SS..|S.Q.|S..P|.SQ.|.S.P|..QP))', s)] </code></pre>

String regex two mismatches Python

How can I extend the code below to allow me to explore all instances where I have 2 mismatches or less between my substring and the parent string?

Substring: SSQP

String-to-match-to: SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ

Here is an example where only one possible mismatch is incorporated:

>>> s = 'SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ'
>>> re.findall(r'(?=(SSQP|[A-Z]SQP|S[A-Z]QP|SS[A-Z]P|SSQ[A-Z]))', s)
['SSQQ', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP']

Obviously, incorporating the possibility of two mismatches in the code above would require a lot of brute-force typing of all the possible combinations.

How can I extend this code (or refactor this code) to explore the possibility of two mismatches?

Furthermore, I want to modify my output so that I get the numeric index returned (not SSQQ or SSQP) of the exact position the substring matched the string.

How to match partial string in Python?

Use the in operator for partial matches, i.e., whether one string contains the other string. x in y returns True if x is contained in y ( x is a substring of y ), and False if it is not. If each character of x is contained in y discretely, False is returned.

How to check for string matches in Python?

String Equals Check in Python In python programming we can check whether strings are equal or not using the “==” or by using the “. __eq__” function. Example: s1 = 'String' s2 = 'String' s3 = 'string' # case sensitive equals check if s1 == s2: print('s1 and s2 are equal.

What is re I in Python?

A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).

You don't have to use re here you can use itertools module instead and save a lot of memory.

You can first extract all sub-strings with length 4 then compare them with your substring and just select those that have less that 2 difference with your substring :

from itertools import izip,islice,tee

def sub_findre(s,substring,diffnumber):
    sublen=len(substring)
    zip_gen=(izip(substring,islice(s,i,i+sublen)) for i in xrange(len(s)))
    for z in zip_gen:
        l,z=tee(z)
        if sum(1 for i,j in l if i==j)>=sublen-diffnumber:
            new=izip(*z)
            next(new)
            yield ''.join(next(new))

Demo:

s='SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ'

substring='SSQP'
print list(sub_findre(s,substring,2))

['SSPQ', 'SPQQ', 'QQQP', 'SSSS', 'SSSQ', 'SSQQ', 'SQQQ', 'SSQP', 'PSQS', 'SSQP', 'SSQP', 'SQPP', 'SSSS', 'SSSQ', 'SSQP', 'PSQS', 'SSQP', 'SSSS', 'SSSQ', 'SSQP', 'PSQS', 'SSQP', 'SSSS', 'SSSQ', 'SSQP', 'PSQ']

If you want to return the indices you need to put the indices in izip which you can use itertools.repeat() to repeat the index with the length of substring :

from itertools import izip,islice,tee,repeat

def sub_findre(s,substring,diffnumber):
    sublen=len(substring)
    zip_gen=(izip(substring,islice(s,i,i+sublen),repeat(i,sublen)) for i in xrange(len(s)))
    for z in zip_gen:
        l,z=tee(z)
        if sum(1 for i,j,_ in l if i==j)>=sublen-diffnumber:
            new=izip(*z)
            next(new)
            next(new)
            yield next(new)[0]

Demo:

print list(sub_findre(s,substring,2))
[0, 1, 4, 8, 9, 10, 11, 15, 20, 23, 27, 28, 32, 33, 34, 39, 42, 46, 47, 48, 53, 56, 60, 61, 62, 67]

The combinatorial explosion is not that bad for two mismatches out of four.

First, observe that you can omit SSQP itself, since it's covered by all of the more lenient cases.

re.findall(r'(?=([A-Z]SQP|S[A-Z]QP|SS[A-Z]P|SSQ[A-Z]))', s)

So, the number of cases is

               4!
C(4, 1) = ––––––––––––– = 4
          1! * (4 - 1)!

For up to two mismatches, the number of cases is

               4!
C(4, 2) = ––––––––––––– = 6
          2! * (4 - 2)!

Namely,

re.findall('(?=(SS..|S.Q.|S..P|.SQ.|.S.P|..QP))', s)

(To simplify the illustration, I've taken the liberty of writing . instead of [A-Z].)

To get the positions of the matches instead of the text of the matches:

[match.start(0) for match in re.finditer('(?=(SS..|S.Q.|S..P|.SQ.|.S.P|..QP))', s)]

String regex two mismatches Python

Tags:

python

string-matching

regex

warship

People also ask

2 Answers

Mazdak

200_success

Recent Activity

Donate For Us

String regex two mismatches Python

Tags:

python

string-matching

regex

warship

People also ask

2 Answers

Mazdak

200_success

Related questions

Recent Activity

Donate For Us