Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python regex module not finding all matches even with overlapping = True

I am using the PyPy regex module with overlapping match support.

I have the following code in which I have a string A and I am looking for a DNA pattern defined in the regular expression using regex. I want to find all matches with my RE including the overlapping ones. regex is missing one of the matches and I have no idea how to fix it.

import regex as re
A = "GGGGAGAAGGGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG"
GQ_list = re.findall(r"[G]{3,6}[ACTG]{1,33}[G]{3,6}[ACTG]{1,33}[G]{3,6}[ACTG]{1,33}[G]{3,6}", A, overlapped=True)

GQ_list returns:

['GGGGAGAAGGGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG',
 'GGGAGAAGGGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG',
 'GGGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG',
 'GGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG',
 'GGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG',
 'GGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG']

This is missing "GGGGAGAAGGGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGG" which is in my string A and matches the regular expression pattern. What is wrong here? What changes should I make to get all matches possible including the overlapping ones?

like image 377
Joe Avatar asked May 03 '26 11:05

Joe


1 Answers

tdlr: With the possibility of multiple regex matches at the same starting string index, re.findall or other regex methods will only find 1 match per starting index. You have to break up the search to find them all...


The issue you have is that a regex findall does not find all combinations from every index; it finds only one match from each index in turn -- usually the longest match. The techniques that find overlapping matches will still miss the multiple matches possible from a single string index. You need to modify your approach.

If you inspect your regex:

([G]{3,6}[ACTG]{1,33}[G]{3,6}[ACTG]{1,33}[G]{3,6}[ACTG]{1,33}[G]{3,6})

you will note that a matched sequence must start with a minimum of 3 G's and end with the same sequence. The sequence between the GGG[in_between_part]GGG is as short as 9 characters and as long as 84 characters (and that may contain the same starting / ending sequence of 'GGG''s).

We can use that information to find all possible string sequences that fit that description. Then we use your regex to filter that the identified sequence are indeed the ones we want.

First find the string index of every possible 'GGG' which is where a sub string would start or end (by definition):

s = "GGGGAGAAGGGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG"

offset=0
indicies=[]
while (s_idx:=s[offset:].find('GGG'))>-1:
    indicies.append(s_idx+offset)
    offset+=s_idx+1

>>> indicies
[0, 1, 8, 9, 10, 11, 21, 47, 66]
# these are the indicies of 'GGG' that might be that start or end
# of a sub string of interest.

Now we have the starting index of every 'GGG' in your string. We can now use a regex and the bisect module to filter for all possible matches in the string of your regex.

We are using bisect to find ending position of a candidate ending anchor, which is the same as the start anchor. The bisect module allows us to construct a slice that form sub strings that a) Start with 'GGG' (from the indicies list) and b) end with 'GGG' and c) have a length in-between the start and end anchors of between 9 and 84 characters. We then use re.fullmatch to assure the candidate substring fully matches your pattern:

import re 
import bisect 

matches=[]  
min_len=3+9
max_len=3+84
pat=re.compile(r'([G]{3,6}[ACTG]{1,33}[G]{3,6}[ACTG]{1,33}[G]{3,6}[ACTG]{1,33}[G]{3,6})')
for x in indicies:
    min_offest=bisect.bisect(indicies,x+min_len)
    max_offset=bisect.bisect(indicies,x+max_len)
    for idx in indicies[min_offest:]+indicies[max_offset:]:
        candidate=s[x:idx+3]
        if pat.fullmatch(candidate):
            matches.append(candidate)

Now we can print all the matches found, with their index in s and length:

>>> for ss in matches: print((s.index(ss), len(ss)),ss)
# This is only a primitive shortcut. If you want the actual
# index, save it when 'candidate' matches the regex

Prints all eight unique matches, including from the same starting index:

(0, 50) GGGGAGAAGGGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGG
(0, 69) GGGGAGAAGGGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG
(1, 49) GGGAGAAGGGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGG
(1, 68) GGGAGAAGGGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG
(8, 61) GGGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG
(9, 60) GGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG
(10, 59) GGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG
(11, 58) GGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG

Note:

As stated in comments, the regex module does support variable width lookbehinds.

You could therefore be tempted to do:

m1=regex.findall(r'([G]{3,6}[ACTG]{1,33}[G]{3,6}[ACTG]{1,33}[G]{3,6}[ACTG]{1,33}[G]{3,6})', s, overlapped=True)     
# produces 6 unique matches 
m2=regex.findall(r'(?<=([G]{3,6}[ACTG]{1,33}[G]{3,6}[ACTG]{1,33}[G]{3,6}[ACTG]{1,33}[G]{3,6}))', s, overlapped=True)
# produces 2 matches, but one is a duplicate from m1

While this combo finds 1 extra string, the one you were looking for, it does not find all 8 unique matches. The string GGGAGAAGGGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGG at index 1 is missed.

like image 163
dawg Avatar answered May 05 '26 00:05

dawg