Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python regex matching whatever matched in previous group (1 out of multiple options)

Tags:

python

regex

Lets say I have the regular expression (?:AA|BB)(.*)(?:AA|BB) which captures everything between the delimiters AA and BB

The problem I encounter is that this will also match AA...BB

How can I make it so that the regular expression only matches AA...AA and BB...BB

like image 791
AlanSTACK Avatar asked Dec 24 '22 08:12

AlanSTACK


2 Answers

If the strings you need to match start and end with the same leading and trailing delimiters, you just need to capture the leading delimiter and use a backreference inside the pattern itself:

(AA|BB)(.*)\1
^     ^    ^^

See the regex demo

In Python, you will have to use re.finditer if you want to get only the group you need, not re.findall that will return a tuple list (and will thus contain AA or BB). To match the substrings from AA till the first next AA, use a lazy quantifier *?: (AA|BB)(.*?)\1

A short Python demo:

import re
p = re.compile(r'(AA|BB)(.*)\1')
test_str = "AA text AA"
print([x.group(2).strip() for x in p.finditer(test_str)])
# => ['text']

If you need to match strings with mismatching leading and trailing delimiters, you will have to use alternation:

AA(.*)AA|BB(.*)BB

Or - a lazy quantifier version to match the closest trailing AAs and BBs:

AA(.*?)ZZ|BB(.*?)YY

Note that this will output empty elements in the results since only one group will be matched. In most Python builds, this pattern should be used with caution if you plan to use it in re.sub (until Python 3.5, the non-matched group is not initialized with an empty string (=None) and might throw an exception.

Here is an extraction sample code with re.finditer:

import re
p = re.compile(r'(AA)(.*?)(ZZ)|(BB)(.*?)(YY)')
test_str = "AA Text 1 here ZZ and BB Text2 there YY"
print("Contents:") 
print([x.group(2).strip() for x in p.finditer(test_str) if x.group(2)])
print([x.group(5).strip() for x in p.finditer(test_str) if x.group(5)])
print("Delimiters:")
print([(x.group(1), x.group(3)) for x in p.finditer(test_str) if x.group(1) and x.group(3)])
print([(x.group(4), x.group(6)) for x in p.finditer(test_str) if x.group(4) and x.group(6)])

Results:

Contents:
['Text 1 here']
['Text2 there']
Delimiters:
[('AA', 'ZZ')]
[('BB', 'YY')]

In real life, with very long and complex texts, these regexps can be unrolled to make matching linear and efficient, but this is a different story.

And last but not least, if you need to match the shortest substring from one delimiter to another that does not contain these delimiters inside, use a tempered greedy token:

AA((?:(?!AA|ZZ).)*)ZZ|BB((?:(?!BB|YY).)*)YY
   ^^^^^^^^^^^^^^^       ^^^^^^^^^^^^^^^ 

See the regex demo to see the difference from AA(.*?)ZZ|BB(.*?)YY.

like image 100
Wiktor Stribiżew Avatar answered May 24 '23 06:05

Wiktor Stribiżew


The question is confusing. From what I understood, you want it to match either AA..AA or BB..BB, but not AA..BB which it is currently matching. I'm awful with regex, but I think this should work:
Edit: Sorry, SE formatting messed it up.

(?:(AA(.*)AA)|(BB(.*)BB))


>>> data = ['AAsometextAA', 'BBothertextBB', 'NotMatched', 'AAalsonotmatchedBB']
>>> matches = filter(lambda x: x is not None, [re.match("(?:(AA(.*)AA)|(BB(.*)BB))", datum) for datum in data])
>>> matches
[<_sre.SRE_Match object at 0x007DC078>, <_sre.SRE_Match object at 0x007DC288>]
>>> for match in matches:
...     print(match.group(0))
...
AAsometextAA
BBothertextBB
>>>
like image 32
Goodies Avatar answered May 24 '23 08:05

Goodies