Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python 3 regex - find all overlapping matches' start and end index in a string

Tags:

python

regex

This was my original approach:

string = '1'*15     
result = re.finditer(r'(?=11111)', string)      # overlapped = True   
                                                # Doesn't work for me 
for i in result:                                # python 3.5
   print(i.start(), i.end())

It finds all overlapping matches, but fails to get the right end index. The output:

1 <_sre.SRE_Match object; span=(0, 0), match=''>
2 <_sre.SRE_Match object; span=(1, 1), match=''>
3 <_sre.SRE_Match object; span=(2, 2), match=''>
4 <_sre.SRE_Match object; span=(3, 3), match=''>
(and so on..)

My Question: How can I find all overlapping matches, and get all the start and end index right as well?

like image 968
Bjango Avatar asked Jan 04 '23 06:01

Bjango


1 Answers

The problem you get is related to the fact that a lookahead is a zero-width assertion that consumes (i.e. adds to the match result) no text. It is a mere position in the string. Thus, all your matches start and end at the same location in the string.

You need to enclose the lookahead pattern with a capturing group (i.e. (?=(11111))) and access start and end of group 1 (with i.start(1) and i.end(1)):

import re
s = '1'*15     
result = re.finditer(r'(?=(11111))', s)

for i in result:
    print(i.start(1), i.end(1))

See the Python demo, its output is

(0, 5)
(1, 6)
(2, 7)
(3, 8)
(4, 9)
(5, 10)
(6, 11)
(7, 12)
(8, 13)
(9, 14)
(10, 15)
like image 154
Wiktor Stribiżew Avatar answered Jan 14 '23 03:01

Wiktor Stribiżew