i am having issues with regex matching in python i have a string as follows:
test_str = ("ICD : 12123575.007787. 098.3,\n"
"193235.1, 132534.0, 17707.1,1777029, V40‚0, 5612356,9899\n")
my regular expression has two main groups bind together with |
and that regular expression is as follows:
regex = r"((?<=ICD\s:\s).*\n.*)|((?<=ICD\s).*)"
Lets call them (A | B)
. Where A = ((?<=ICD\s:\s).*\n.*)
and B = ((?<=ICD\s).*)
. According to documentation |
works in a way where if A
is matched it won't go further with B
.
Now my problem is that when i use above mentioned regular expression test_str
. It matches for B
but not for A
. But if i search with regular expression A
only (i.e. ((?<=ICD\s:\s).*\n.*)
), then the test_string
is matched with the regular expression A
. So my question is that why with A|B
regular expression is not matched with group A
and stopped. Following is my python code:
import re
regex = r"((?<=ICD\s:\s).*\n.*)|((?<=ICD\s).*)"
test_str = ("ICD : 12123575.007787. 098.3,\n"
"193235.1, 132534.0, 17707.1,1777029, V40‚0, 5612356,9899\n")
matches = re.search(regex, test_str)
if matches:
print ("Match was found at {start}-{end}: {match}".format(
start = matches.start(),
end = matches.end(),
match = matches.group()))
for groupNum in range(0, len(matches.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(
groupNum = groupNum,
start = matches.start(groupNum),
end = matches.end(groupNum),
group = matches.group(groupNum)))
output:
Match was found at 4-29: : 12123575.007787. 098.3,
Group 1 found at -1--1: None
Group 2 found at 4-29: : 12123575.007787. 098.3,
Python Fiddle
Sorry if you are not able to understand. I don't know why Group 1 found at -1--1: None
is not matched. Let me know what could be the reason if you understood it.
You make it non-greedy by using ". *?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ". *?" . This means that if for instance nothing comes after the ".
Escape Sequences (\char): To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" .
The standard quantifiers in regular expressions are greedy, meaning they match as much as they can, only giving back as necessary to match the remainder of the regex. By using a lazy quantifier, the expression tries the minimal match first.
The ?! n quantifier matches any string that is not followed by a specific string n.
The reason why this happens is because regex searches for a match from left to right, and the right half of the regex matches earlier. This is because the left expression has a longer lookbehind: (?<=ICD\s:\s)
requires two more characters than (?<=ICD\s)
.
test_str = "ICD : 12123575.007787. 098.3,\n"
# ^ left half of the regex matches here
# ^ right half of the regex matches here
To put it another way, your regexes are essentially like (?<=.{3})
and (?<=.)
. If you tried re.search(r'(?<=.{3})|(?<=.)', some_text)
, it's clear that the right side of the regex would match first, because its lookbehind is shorter.
You can fix this by preventing the right half of the regex from matching too early by adding a negative lookahead:
regex = r"((?<=ICD\s:\s).*\n.*)|((?<=ICD\s)(?!:\s).*)"
# ^^^^^^^
test_str = "ICD : 12123575.007787. 098.3,\n"
# ^ left half of the regex matches here
# right half of the regex matches doesn't match at all
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With