How can I obtain the number of overlapping regex matches using Python?
I've read and tried the suggestions from this, that and a few other questions, but found none that would work for my scenario. Here it is:
akka
a.*k
A proper function should yield 2 as the number of matches, since there are two possible end positions (k
letters).
The pattern might also be more complicated, for example a.*k.*a
should also be matched twice in akka
(since there are two k
's in the middle).
I think that what you're looking for is probably better done with a parsing library like lepl:
>>> from lepl import *
>>> parser = Literal('a') + Any()[:] + Literal('k')
>>> parser.config.no_full_first_match()
>>> list(parser.parse_all('akka'))
[['akk'], ['ak']]
>>> parser = Literal('a') + Any()[:] + Literal('k') + Any()[:] + Literal('a')
>>> list(parser.parse_all('akka'))
[['akka'], ['akka']]
I believe that the length of the output from parser.parse_all
is what you're looking for.
Note that you need to use parser.config.no_full_first_match()
to avoid errors if your pattern doesn't match the whole string.
Edit: Based on the comment from @Shamanu4, I see you want matching results starting from any position, you can do that as follows:
>>> text = 'bboo'
>>> parser = Literal('b') + Any()[:] + Literal('o')
>>> parser.config.no_full_first_match()
>>> substrings = [text[i:] for i in range(len(text))]
>>> matches = [list(parser.parse_all(substring)) for substring in substrings]
>>> matches = filter(None, matches) # Remove empty matches
>>> matches = list(itertools.chain.from_iterable(matches)) # Flatten results
>>> matches = list(itertools.chain.from_iterable(matches)) # Flatten results (again)
>>> matches
['bboo', 'bbo', 'boo', 'bo']
Yes, it is ugly and unoptimized but it seems to be working. This is a simple try of all possible but unique variants
def myregex(pattern,text,dir=0):
import re
m = re.search(pattern, text)
if m:
yield m.group(0)
if len(m.group('suffix')):
for r in myregex(pattern, "%s%s%s" % (m.group('prefix'),m.group('suffix')[1:],m.group('end')),1):
yield r
if dir<1 :
for r in myregex(pattern, "%s%s%s" % (m.group('prefix'),m.group('suffix')[:-1],m.group('end')),-1):
yield r
def myprocess(pattern, text):
parts = pattern.split("*")
for i in range(0, len(parts)-1 ):
res=""
for j in range(0, len(parts) ):
if j==0:
res+="(?P<prefix>"
if j==i:
res+=")(?P<suffix>"
res+=parts[j]
if j==i+1:
res+=")(?P<end>"
if j<len(parts)-1:
if j==i:
res+=".*"
else:
res+=".*?"
else:
res+=")"
for r in myregex(res,text):
yield r
def mycount(pattern, text):
return set(myprocess(pattern, text))
test:
>>> mycount('a*b*c','abc')
set(['abc'])
>>> mycount('a*k','akka')
set(['akk', 'ak'])
>>> mycount('b*o','bboo')
set(['bbo', 'bboo', 'bo', 'boo'])
>>> mycount('b*o','bb123oo')
set(['b123o', 'bb123oo', 'bb123o', 'b123oo'])
>>> mycount('b*o','ffbfbfffofoff')
set(['bfbfffofo', 'bfbfffo', 'bfffofo', 'bfffo'])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With