i am having issues with regex matching in python i have a string as follows: <pre class="prettyprint"><code>test_str = ("ICD : 12123575.007787. 098.3,\n" "193235.1, 132534.0, 17707.1,1777029, V40&sbquo;0, 5612356,9899\n") </code></pre> my regular expression has two main groups bind together with <code>|</code> and that regular expression is as follows: <pre class="prettyprint"><code> regex = r"((?<=ICD\s:\s).*\n.*)|((?<=ICD\s).*)" </code></pre> Lets call them <code>(A | B)</code>. Where <code>A = ((?<=ICD\s:\s).*\n.*)</code> and <code>B = ((?<=ICD\s).*)</code>. According to documentation <code>|</code> works in a way where if <code>A</code> is matched it won't go further with <code>B</code>. Now my problem is that when i use above mentioned regular expression <code>test_str</code>. It matches for <code>B</code> but not for <code>A</code>. But if i search with regular expression <code>A</code> only (i.e. <code>((?<=ICD\s:\s).*\n.*)</code>), then the <code>test_string</code> is matched with the regular expression <code>A</code>. So my question is that why with <code>A|B</code> regular expression is not matched with group <code>A</code> and stopped. Following is my python code: <pre class="prettyprint"><code>import re regex = r"((?<=ICD\s:\s).*\n.*)|((?<=ICD\s).*)" test_str = ("ICD : 12123575.007787. 098.3,\n" "193235.1, 132534.0, 17707.1,1777029, V40&sbquo;0, 5612356,9899\n") matches = re.search(regex, test_str) if matches: print ("Match was found at {start}-{end}: {match}".format( start = matches.start(), end = matches.end(), match = matches.group())) for groupNum in range(0, len(matches.groups())): groupNum = groupNum + 1 print ("Group {groupNum} found at {start}-{end}: {group}".format( groupNum = groupNum, start = matches.start(groupNum), end = matches.end(groupNum), group = matches.group(groupNum))) </code></pre> output: <pre class="prettyprint"><code>Match was found at 4-29: : 12123575.007787. 098.3, Group 1 found at -1--1: None Group 2 found at 4-29: : 12123575.007787. 098.3, </code></pre> Python Fiddle Sorry if you are not able to understand. I don't know why <code>Group 1 found at -1--1: None</code> is not matched. Let me know what could be the reason if you understood it.

The reason why this happens is because regex searches for a match from left to right, and the right half of the regex matches earlier. This is because the left expression has a longer lookbehind: <code>(?<=ICD\s:\s)</code> requires two more characters than <code>(?<=ICD\s)</code>. <pre class="prettyprint"><code>test_str = "ICD : 12123575.007787. 098.3,\n" # ^ left half of the regex matches here # ^ right half of the regex matches here </code></pre> To put it another way, your regexes are essentially like <code>(?<=.{3})</code> and <code>(?<=.)</code>. If you tried <code>re.search(r'(?<=.{3})|(?<=.)', some_text)</code>, it's clear that the right side of the regex would match first, because its lookbehind is shorter. <hr> You can fix this by preventing the right half of the regex from matching too early by adding a negative lookahead: <pre class="prettyprint"><code>regex = r"((?<=ICD\s:\s).*\n.*)|((?<=ICD\s)(?!:\s).*)" # ^^^^^^^ test_str = "ICD : 12123575.007787. 098.3,\n" # ^ left half of the regex matches here # right half of the regex matches doesn't match at all </code></pre>

Regex doesn't stop evaluating after matching with first rule with OR operator

Q: How do I stop regex from being greedy?

You make it non-greedy by using ". *?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ". *?" . This means that if for instance nothing comes after the ".

Q: How do you stop special characters in regex?

Escape Sequences (\char): To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" .

Q: What is greedy regex?

The standard quantifiers in regular expressions are greedy, meaning they match as much as they can, only giving back as necessary to match the remainder of the regex. By using a lazy quantifier, the expression tries the minimal match first.

Q: What is ?! In regex?

The ?! n quantifier matches any string that is not followed by a specific string n.

Tags:

python

regex

i am having issues with regex matching in python i have a string as follows:

test_str = ("ICD : 12123575.007787. 098.3,\n"
    "193235.1, 132534.0, 17707.1,1777029, V40‚0, 5612356,9899\n")

my regular expression has two main groups bind together with | and that regular expression is as follows:

  regex =   r"((?<=ICD\s:\s).*\n.*)|((?<=ICD\s).*)"

Lets call them (A | B). Where A = ((?<=ICD\s:\s).*\n.*) and B = ((?<=ICD\s).*). According to documentation | works in a way where if A is matched it won't go further with B.

Now my problem is that when i use above mentioned regular expression test_str. It matches for B but not for A. But if i search with regular expression A only (i.e. ((?<=ICD\s:\s).*\n.*)), then the test_string is matched with the regular expression A. So my question is that why with A|B regular expression is not matched with group A and stopped. Following is my python code:

import re

regex = r"((?<=ICD\s:\s).*\n.*)|((?<=ICD\s).*)"

test_str = ("ICD : 12123575.007787. 098.3,\n"
    "193235.1, 132534.0, 17707.1,1777029, V40‚0, 5612356,9899\n")

matches = re.search(regex, test_str)
if matches:
    print ("Match was found at {start}-{end}: {match}".format(
        start = matches.start(), 
        end = matches.end(), 
        match = matches.group()))

    for groupNum in range(0, len(matches.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(
            groupNum = groupNum, 
            start = matches.start(groupNum), 
            end = matches.end(groupNum), 
            group = matches.group(groupNum)))

output:

Match was found at 4-29: : 12123575.007787. 098.3,
Group 1 found at -1--1: None
Group 2 found at 4-29: : 12123575.007787. 098.3,

Python Fiddle

Sorry if you are not able to understand. I don't know why Group 1 found at -1--1: None is not matched. Let me know what could be the reason if you understood it.

694

asked Aug 21 '17 15:08

Seeker

1 Answers

The reason why this happens is because regex searches for a match from left to right, and the right half of the regex matches earlier. This is because the left expression has a longer lookbehind: (?<=ICD\s:\s) requires two more characters than (?<=ICD\s).

test_str = "ICD : 12123575.007787. 098.3,\n"
#                 ^ left half of the regex matches here
#               ^ right half of the regex matches here

To put it another way, your regexes are essentially like (?<=.{3}) and (?<=.). If you tried re.search(r'(?<=.{3})|(?<=.)', some_text), it's clear that the right side of the regex would match first, because its lookbehind is shorter.

You can fix this by preventing the right half of the regex from matching too early by adding a negative lookahead:

regex = r"((?<=ICD\s:\s).*\n.*)|((?<=ICD\s)(?!:\s).*)"
#                                          ^^^^^^^

test_str = "ICD : 12123575.007787. 098.3,\n"
#                 ^ left half of the regex matches here
#          right half of the regex matches doesn't match at all

146

answered Oct 23 '22 13:10

Aran-Fey

Related questions
                            
                                Is it necessary to install Python for react-native on windows for "react-native init AwesomeProject"?
                            
                                How can I set user full name in foreignkey field with User Model using on_delete attribute?
                            
                                Raspberry Pi-Python: Install Pandas on Python 3.5.2
                            
                                How to sort unsort: array(1).sort transform of array(2) -> array(3).unsort (reversed array(1).sort
                            
                                Passing command line arguments in python by pytest
                            
                                Performance of pyomo to generate a model with a huge number of constraints
                            
                                anaconda cannot import matplotlib.pyplot
                            
                                Pass variable from Python to Bash
                            
                                ImportError: No module named easydict [closed]
                            
                                How TensorArray and while_loop work together in tensorflow?
                            
                                Remove ttk Combobox Mousewheel Binding
                            
                                Update existing virtualenv to use Python 3.6 [duplicate]
                            
                                How to insert a new column with repeated values into a pandas table? [duplicate]
                            
                                Pycharm plugin for attrs?
                            
                                Bokeh Circle does not fit into square?
                            
                                Graph k-NN decision boundaries in Matplotlib
                            
                                Freeze a program created with Python's `click` pacage
                            
                                Python and Selenium - Avoid submit form when send_keys() with newline
                            
                                Importing module not working
                            
                                retrieve async ads insights results from FB ads API with pagination

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With