I need to know start and end indexes of matches from next regular expression:
pat = re.compile("(?=(ATG(?:(?!TAA|TGA|TAG)\w\w\w)*))")
Example string is s='GATGDTATGDTAAAA'
pat.findall(s)
returns desired matches ['ATGDTATGD', 'ATGDTAAAA']
. How to extract start and end indexes?
I tried:
iters = pat.finditer(s)
for it in iters:
print it.start()
print it.end()
However, it.end()
always coincide with it.start()
, because the beginning of my pattern starts fro (?=
so it does not consume any string (I need it to capture overlapping matches). Obviously pat.findall
extracted desired string, but how to get start and stop indexes?
As @Tomalak said, the regexp engine has no builtin notion of overlapping matches, so there's no "clever" solution to be found (which turned out to be wrong - see below). But it's straightforward to do it with a loop:
import re
pat = re.compile("ATG(?:(?!TAA|TGA|TAG)\w\w\w)*")
s = 'GATGDTATGDTAAAA'
i = 0
while True:
m = pat.search(s, i)
if m:
start, end = m.span()
print "match at {}:{} {!r}".format(start, end, m.group())
i = start + 1
else:
break
which displays
match at 1:10 'ATGDTATGD'
match at 6:15 'ATGDTAAAA'
It works by starting the search over again one character beyond the start of the last match, until no more matches are found.
If you want to live dangerously, there's a 2-character change you can make to your original finditer
code:
print it.start(1)
print it.end(1)
That is, get the start and end of the first (1
) capturing group. By not passing an argument, you were getting the start and end of the match as a whole - but of course a matching assertion always matches an empty string (and so start and end are equal).
I say this is living dangerously because the semantics of a capturing group inside an assertion (whether lookahead or lookbehind, positive or negative, ...) are fuzzy at best. Hard to say whether you may have stumbled into a bug (or implementation accident) here! Cute :-)
EDIT: After a night's sleep and a brief discussion on Python-Dev, I believe this behavior is intentional (& so reliable too). To find all the (possibly overlapping!) matches for a regexp R, wrap it like so:
pat = re.compile("(?=(" + R + "))")
and then
for m in pat.finditer(some_string):
m.group(1) # the matched substring
m.span(1) # the slice indices of the match substring
# etc
works fine.
Best to read (?=(R))
as "match an empty string here, but only if R
starts here, and, if that succeeds, put the info about what R
matched into group 1". Then finditer()
proceeds as it always does when matching an empty string: it moves the start of the search to the next character, and tries again (same as what the by-hand loop in my first answer did).
Using this with findall()
is trickier, because if R
contains capturing groups too, you'll get all of them (can't pick & choose, as you can do with a match object such as finditer()
returns).
There are no overlapping matches in regular expressions.
Either you match something or you don't. Anything you match can only be part of one match/sub-match.
Look-aheads are ephemeral, they don't increase any real counters.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With