Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract named group regex pattern from a compiled regex in Python

I have a regex in Python that contains several named groups. However, patterns that match one group can be missed if previous groups have matched because overlaps don't seem to be allowed. As an example:

import re
myText = 'sgasgAAAaoasgosaegnsBBBausgisego'
myRegex = re.compile('(?P<short>(?:AAA))|(?P<long>(?:AAA.*BBB))')

x = re.findall(myRegex,myText)
print(x)

Produces the output:

[('AAA', '')]

The 'long' group does not find a match because 'AAA' was used-up in finding a match for the preceding 'short' group.

I've tried to find a method to allow overlapping but failed. As an alternative, I've been looking for a way to run each named group separately. Something like the following:

for g in myRegex.groupindex.keys():
    match = re.findall(***regex_for_named_group_g***,myText)

Is it possible to extract the regex for each named group?

Ultimately, I'd like to produce a dictionary output (or similar) like:

{'short':'AAA',
 'long':'AAAaoasgosaegnsBBB'}

Any and all suggestions would be gratefully received.

like image 746
user1718097 Avatar asked Feb 19 '18 01:02

user1718097


2 Answers

There really doesn't appear to be a nicer way to do this, but here's a another approach, along the lines of this other answer but somewhat simpler. It will work provided that a) your patterns will always formed as a series of named groups separated by pipes, and b) the named group patterns never contain named groups themselves.

The following would be my approach if you're interested in all matches of each pattern. The argument to re.split looks for a literal pipe followed by the (?=<, the beginning of a named group. It compiles each subpattern and uses the groupindex attribute to extract the name.

def nameToMatches(pattern, string):
    result = dict()
    for subpattern in re.split('\|(?=\(\?P<)', pattern):
        rx = re.compile(subpattern)
        name = list(rx.groupindex)[0]
        result[name] = rx.findall(string)
    return result

With your given text and pattern, returns {'long': ['AAAaoasgosaegnsBBB'], 'short': ['AAA']}. Patterns that don't match at all will have an empty list for their value.

If you only want one match per pattern, you can make it a bit simpler still:

def nameToMatch(pattern, string):
    result = dict()
    for subpattern in re.split('\|(?=\(\?P<)', pattern):
        match = re.search(subpattern, string)
        if match:
            result.update(match.groupdict())
    return result

This gives {'long': 'AAAaoasgosaegnsBBB', 'short': 'AAA'} for your givens. If one of the named groups doesn't match at all, it will be absent from the dict.

like image 78
Nathan Vērzemnieks Avatar answered Oct 23 '22 20:10

Nathan Vērzemnieks


There didn't seem to be an obvious answer, so here's a hack. It needs a bit of finessing but basically it splits the original regex into its component parts and runs each group regex separately on the original text.

import re

myTextStr = 'sgasgAAAaoasgosaegnsBBBausgisego'
myRegexStr = '(?P<short>(?:AAA))|(?P<long>(?:AAA.*BBB))'
myRegex = re.compile(myRegexStr)   # This is actually no longer needed

print("Full regex with multiple groups")
print(myRegexStr)

# Use a regex to split the original regex into separate regexes
# based on group names
mySplitGroupsRegexStr = '\(\?P<(\w+)>(\([\w\W]+?\))\)(?:\||\Z)'
mySplitGroupsRegex = re.compile(mySplitGroupsRegexStr)
mySepRegexesList = re.findall(mySplitGroupsRegex,myRegexStr)

print("\nList of separate regexes")
print(mySepRegexesList)

# Convert separate regexes to a dict with group name as key
# and regex as value
mySepRegexDict = {reg[0]:reg[1] for reg in mySepRegexesList}
print("\nDictionary of separate regexes with group names as keys")
print(mySepRegexDict)

# Step through each key and run the group regex on the original text.
# Results are stored in a dictionary with group name as key and
# extracted text as value.
myGroupRegexOutput = {}
for g,r in mySepRegexDict.items():
    m = re.findall(re.compile(r),myTextStr)
    myGroupRegexOutput[g] = m[0]

print("\nOutput of overlapping named group regexes")
print(myGroupRegexOutput)

The resulting output is:

Full regex with multiple groups
(?P<short>(?:AAA))|(?P<long>(?:AAA.*BBB))

List of separate regexes
[('short', '(?:AAA)'), ('long', '(?:AAA.*BBB)')]

Dictionary of separate regexes with group names as keys
{'short': '(?:AAA)', 'long': '(?:AAA.*BBB)'}

Output of overlapping named group regexes
{'short': 'AAA', 'long': 'AAAaoasgosaegnsBBB'}

This might be useful to someone somewhere.

like image 35
user1718097 Avatar answered Oct 23 '22 18:10

user1718097