Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get start and stop indexes of overlapping matches?

Tags:

python

regex

I need to know start and end indexes of matches from next regular expression:

pat = re.compile("(?=(ATG(?:(?!TAA|TGA|TAG)\w\w\w)*))")

Example string is s='GATGDTATGDTAAAA'

pat.findall(s) returns desired matches ['ATGDTATGD', 'ATGDTAAAA']. How to extract start and end indexes? I tried:

iters = pat.finditer(s)
for it in iters:
    print it.start()
    print it.end()

However, it.end() always coincide with it.start(), because the beginning of my pattern starts fro (?= so it does not consume any string (I need it to capture overlapping matches). Obviously pat.findall extracted desired string, but how to get start and stop indexes?

like image 481
ashim Avatar asked Nov 15 '13 04:11

ashim


Video Answer


2 Answers

As @Tomalak said, the regexp engine has no builtin notion of overlapping matches, so there's no "clever" solution to be found (which turned out to be wrong - see below). But it's straightforward to do it with a loop:

import re
pat = re.compile("ATG(?:(?!TAA|TGA|TAG)\w\w\w)*")
s = 'GATGDTATGDTAAAA'
i = 0
while True:
    m = pat.search(s, i)
    if m:
        start, end = m.span()
        print "match at {}:{} {!r}".format(start, end, m.group())
        i = start + 1
    else:
        break

which displays

match at 1:10 'ATGDTATGD'
match at 6:15 'ATGDTAAAA'

It works by starting the search over again one character beyond the start of the last match, until no more matches are found.

"Clever" or time bomb?

If you want to live dangerously, there's a 2-character change you can make to your original finditer code:

print it.start(1)
print it.end(1)

That is, get the start and end of the first (1) capturing group. By not passing an argument, you were getting the start and end of the match as a whole - but of course a matching assertion always matches an empty string (and so start and end are equal).

I say this is living dangerously because the semantics of a capturing group inside an assertion (whether lookahead or lookbehind, positive or negative, ...) are fuzzy at best. Hard to say whether you may have stumbled into a bug (or implementation accident) here! Cute :-)

EDIT: After a night's sleep and a brief discussion on Python-Dev, I believe this behavior is intentional (& so reliable too). To find all the (possibly overlapping!) matches for a regexp R, wrap it like so:

pat = re.compile("(?=(" + R + "))")

and then

for m in pat.finditer(some_string):
    m.group(1)  # the matched substring
    m.span(1)   # the slice indices of the match substring
    # etc

works fine.

Best to read (?=(R)) as "match an empty string here, but only if R starts here, and, if that succeeds, put the info about what R matched into group 1". Then finditer() proceeds as it always does when matching an empty string: it moves the start of the search to the next character, and tries again (same as what the by-hand loop in my first answer did).

Using this with findall() is trickier, because if R contains capturing groups too, you'll get all of them (can't pick & choose, as you can do with a match object such as finditer() returns).

like image 185
Tim Peters Avatar answered Sep 27 '22 01:09

Tim Peters


There are no overlapping matches in regular expressions.

Either you match something or you don't. Anything you match can only be part of one match/sub-match.

Look-aheads are ephemeral, they don't increase any real counters.

like image 38
Tomalak Avatar answered Sep 25 '22 01:09

Tomalak