Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex doesn't stop evaluating after matching with first rule with OR operator

Tags:

python

regex

i am having issues with regex matching in python i have a string as follows:

test_str = ("ICD : 12123575.007787. 098.3,\n"
    "193235.1, 132534.0, 17707.1,1777029, V40‚0, 5612356,9899\n")

my regular expression has two main groups bind together with | and that regular expression is as follows:

  regex =   r"((?<=ICD\s:\s).*\n.*)|((?<=ICD\s).*)"

Lets call them (A | B). Where A = ((?<=ICD\s:\s).*\n.*) and B = ((?<=ICD\s).*). According to documentation | works in a way where if A is matched it won't go further with B.

Now my problem is that when i use above mentioned regular expression test_str. It matches for B but not for A. But if i search with regular expression A only (i.e. ((?<=ICD\s:\s).*\n.*)), then the test_string is matched with the regular expression A. So my question is that why with A|B regular expression is not matched with group A and stopped. Following is my python code:

import re

regex = r"((?<=ICD\s:\s).*\n.*)|((?<=ICD\s).*)"

test_str = ("ICD : 12123575.007787. 098.3,\n"
    "193235.1, 132534.0, 17707.1,1777029, V40‚0, 5612356,9899\n")

matches = re.search(regex, test_str)
if matches:
    print ("Match was found at {start}-{end}: {match}".format(
        start = matches.start(), 
        end = matches.end(), 
        match = matches.group()))

    for groupNum in range(0, len(matches.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(
            groupNum = groupNum, 
            start = matches.start(groupNum), 
            end = matches.end(groupNum), 
            group = matches.group(groupNum)))

output:

Match was found at 4-29: : 12123575.007787. 098.3,
Group 1 found at -1--1: None
Group 2 found at 4-29: : 12123575.007787. 098.3,

Python Fiddle

Sorry if you are not able to understand. I don't know why Group 1 found at -1--1: None is not matched. Let me know what could be the reason if you understood it.

like image 694
Seeker Avatar asked Aug 21 '17 15:08

Seeker


People also ask

How do I stop regex from being greedy?

You make it non-greedy by using ". *?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ". *?" . This means that if for instance nothing comes after the ".

How do you stop special characters in regex?

Escape Sequences (\char): To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" .

What is greedy regex?

The standard quantifiers in regular expressions are greedy, meaning they match as much as they can, only giving back as necessary to match the remainder of the regex. By using a lazy quantifier, the expression tries the minimal match first.

What is ?! In regex?

The ?! n quantifier matches any string that is not followed by a specific string n.


1 Answers

The reason why this happens is because regex searches for a match from left to right, and the right half of the regex matches earlier. This is because the left expression has a longer lookbehind: (?<=ICD\s:\s) requires two more characters than (?<=ICD\s).

test_str = "ICD : 12123575.007787. 098.3,\n"
#                 ^ left half of the regex matches here
#               ^ right half of the regex matches here

To put it another way, your regexes are essentially like (?<=.{3}) and (?<=.). If you tried re.search(r'(?<=.{3})|(?<=.)', some_text), it's clear that the right side of the regex would match first, because its lookbehind is shorter.


You can fix this by preventing the right half of the regex from matching too early by adding a negative lookahead:

regex = r"((?<=ICD\s:\s).*\n.*)|((?<=ICD\s)(?!:\s).*)"
#                                          ^^^^^^^

test_str = "ICD : 12123575.007787. 098.3,\n"
#                 ^ left half of the regex matches here
#          right half of the regex matches doesn't match at all
like image 146
Aran-Fey Avatar answered Oct 23 '22 13:10

Aran-Fey