I have a file that includes a bunch of strings like "size=XXX;". I am trying Python's re
module for the first time and am a bit mystified by the following behavior: if I use a pipe for 'or' in a regular expression, I only see that bit of the match returned. E.g.:
>>> myfile = open('testfile.txt', 'r').read() >>> re.findall('size=50;', myfile) ['size=50;', 'size=50;', 'size=50;', 'size=50;'] >>> re.findall('size=51;', myfile) ['size=51;', 'size=51;', 'size=51;'] >>> re.findall('size=(50|51);', myfile) ['51', '51', '51', '50', '50', '50', '50'] >>> re.findall(r'size=(50|51);', myfile) ['51', '51', '51', '50', '50', '50', '50']
The "size=" part of the match is gone (Yet it is certainly used in the search, otherwise there would be more results). What am I doing wrong?
findall(): Finding all matches in a string/list. Regex's findall() function is extremely useful as it returns a list of strings containing all matches. If the pattern is not found, re. findall() returns an empty list.
The findall() function scans the string from left to right and finds all the matches of the pattern in the string . The result of the findall() function depends on the pattern: If the pattern has no capturing groups, the findall() function returns a list of strings that match the whole pattern.
If the pattern includes no parenthesis, then findall() returns a list of found strings as in earlier examples. If the pattern includes a single set of parenthesis, then findall() returns a list of strings corresponding to that single group.
This function only checks for a match at the beginning of the string. This means that re. match() will return the match found in the first line of the string, but not those found in any other line, in which case it will return null .
The problem you have is that if the regex that re.findall
tries to match captures groups (i.e. the portions of the regex that are enclosed in parentheses), then it is the groups that are returned, rather than the matched string.
One way to solve this issue is to use non-capturing groups (prefixed with ?:
).
>>> import re >>> s = 'size=50;size=51;' >>> re.findall('size=(?:50|51);', s) ['size=50;', 'size=51;']
If the regex that re.findall
tries to match does not capture anything, it returns the whole of the matched string.
Although using character classes might be the simplest option in this particular case, non-capturing groups provide a more general solution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With