Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to return the regex that matches some text?

The answer to Javascript regex question Return the part of the regex that matched is "No, because compilation destroys the relationship between the regex text and the matching logic."

But Python preserves Match Objects, and re.groups() returns the specific group(s) that triggered a match. It should be simple to preserve the regex text of each group as part of a Match Object and return it, but there doesn't appear to be a call to do so.

import re

pat = "(^\d+$)|(^\w+$)|(^\W+$)"
test = ['a', 'c3', '36d', '51', '29.5', '#$%&']
for t in test:
    m = re.search(pat, t)
    s = (m.lastindex, m.groups()) if m else ''
    print(str(bool(m)), s)

This returns:

True (2, (None, 'a', None))
True (2, (None, 'c3', None))
True (1, ('51', None, None))
False
True (3, (None, None, '#$%&'))

The compiler obviously knows that there are three groups in this pattern. Is there a way to extract the subpattern in each group in a regex, with something like:

>>> print(m.regex_group_text)

('^\d+$', '^\w+$', '^\W+$')

Yes, it would be possible to write a custom pattern parser, for example to split on '|' for this particular pattern. But it would be far easier and more reliable to use the re compiler's understanding of the text in each group.

like image 522
Dave Avatar asked Mar 11 '16 22:03

Dave


1 Answers

If the indices are not sufficient and you absolutely need to know the exact part of the regex, there is probably no other possibility but to parse the expression's groups on your own.

All in all, this is no big deal, since you can simply count opening and closing brackets and log their indices:

def locateBraces(inp):
    bracePositions = []
    braceStack = []
    depth = 0
    for i in range(len(inp)):
        if inp[i] == '(':
            braceStack.append(i)
            depth += 1
        if inp[i] == ')':
            bracePositions.append((braceStack.pop(), i))
            depth -= 1
            if depth < 0:
                raise SyntaxError('Too many closing braces.')
    if depth != 0:
        raise SyntaxError('Too many opening braces.')
    return bracePositions

Edited: This dumb implementation only counts opening and closing braces. However, regexes may contain escaped braces, e.g. \(, which are counted as regular group-defining braces using this method. You may want to adapt it to omit braces that have an uneven number of backslashes right before them. I leave this issue as a task for you ;)

With this function, your example becomes:

pat = "(^\d+$)|(^\w+$)|(^\W+$)"
bloc = locateBraces(pat)

test = ['a', 'c3', '36d', '51', '29.5', '#$%&']
for t in test:
    m = re.search(pat, t)
    print(str(bool(m)), end='')
    if m:
        h = bloc[m.lastindex - 1]
        print(' %s' % (pat[h[0]:h[1] + 1]))
    else:
        print()

Which returns:

True (^\w+$)
True (^\w+$)
True (^\w+$)
True (^\d+$)
False
True (^\W+$)

Edited: To get the list of your groups, of course a simple comprehension would do:

gtxt = [pat[b[0]:b[1] + 1] for b in bloc]
like image 187
jbndlr Avatar answered Oct 10 '22 03:10

jbndlr