Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using a regular expression backreference as part of the regular expression in Python

Tags:

python

regex

I am attempting to using part of a regular expression as input for a later part of the regular expression.

What I have so far (which fails the assertions):

import re
regex = re.compile(r"(?P<length>\d+)(\d){(?P=length)}")
assert bool(regex.match("3123")) is True
assert bool(regex.match("100123456789")) is True

Breaking this down, the first digit(s) signify how many digits should be matched afterward. In the first assertion, I get a 3, as the first character, which means there should be exactly three digits after, otherwise there are more than 9 digits. If there are more than 9 digits, then the first group will need to be expanded and checked against the rest of the digits.

The regular expression 3(\d){3} would properly match the first assertion, however I cannot get the regular expression to match the general case where the braces {} are fed a regular expression backreference: {(?P=length)}

Calling the regular expression with the re.DEBUG flag I get:

subpattern 1
  max_repeat 1 4294967295
    in
      category category_digit
subpattern 2
  in
    category category_digit
literal 123
groupref 1
literal 125

It looks like the braces { (123) and } (125) are being interpreted as literals when there is a backreference inside of them.
When there is no backreference, such as {3}, I can see that {3} is being interpreted as max_repeat 3 3

Is using a backreference as part of a regular expression possible?

like image 360
Bryce Guinta Avatar asked Dec 10 '16 22:12

Bryce Guinta


People also ask

What is a regex Backreference?

A backreference in a regular expression identifies a previously matched group and looks for exactly the same text again. A simple example of the use of backreferences is when you wish to look for adjacent, repeated words in some text. The first part of the match could use a pattern that extracts a single word.

How do I reference a capture group in regex Python?

The syntax is \N or \g<N> where N is the capture group you want.

How do you use special characters in regex Python?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).


1 Answers

There is no way to put the backreference as a limiting quantifier argument inside the pattern. To solve your current task, I can suggest the following code (see inline comments explaining the logic):

import re
def checkNum(s):
    first = ''
    if s == '0':
        return True # Edge case, 0 is valid input
    m = re.match(r'\d{2,}$', s) # The string must be all digits, at least 2
    if m:
        lim = ''
        for i, n in enumerate(s):
            lim = lim + n
            if re.match(r'{0}\d{{{0}}}$'.format(lim), s):
                return True
            elif int(s[0:i+1]) > len(s[i+1:]):
                return False

print(checkNum('3123'))         # Meets the pattern (123 is 3 digit chunk after 3)
print(checkNum('1234567'))      # Does not meet the pattern, just 7 digits
print(checkNum('100123456789')) # Meets the condition, 10 is followed with 10 digits
print(checkNum('9123456789'))   # Meets the condition, 9 is followed with 9 digits

See the Python demo online.

The pattern used with re.match (that anchors the pattern at the start of the string) is {0}\d{{{0}}}$ that will look like 3\d{3}$ in case the 3123 is passed to the checkNum method. It will match a string that starts with 3, and then will match exactly 3 digits followed with end of string marker ($).

like image 101
Wiktor Stribiżew Avatar answered Oct 27 '22 08:10

Wiktor Stribiżew