According to the pattern match here, the matches are 213.239.250.131
and 014.10.26.06
.
Yet when I run the generated Python code and print out the value of re.findall(p, test_str)
, I get:
[('', '', '213.239.250.131'), ('', '', '014.10.26.06')]
I could hack around the list and it tuples to get the values I'm looking for (the IP addresses), but (i) they might not always be in the same position in the tuples and (ii) I'd rather understand what's going on here so I can either tighten up the regex, or extract only IP addresses using Python's own re
functionality.
Why do I get this list of tuples, why the apparent whitespace matches, and how do we ensure that only the IP addresses are returned?
The findall() function scans the string from left to right and finds all the matches of the pattern in the string . The result of the findall() function depends on the pattern: If the pattern has no capturing groups, the findall() function returns a list of strings that match the whole pattern.
The re. findall(pattern, string) method scans string from left to right, searching for all non-overlapping matches of the pattern . It returns a list of strings in the matching order when scanning the string from left to right.
findall() finds *all* the matches and returns them as a list of strings, with each string representing one match.
But finditer and findall are finding different things. Findall indeed finds all the matches in the given string. But finditer only finds the first one, returning an iterator with only one element.
Whenever you are using a capturing group, it always returns a submatch, even if it is empty/null. You have 3 capturing groups, so you will always have them in the findall
result.
In regex101.com, you can see these non-participating groups by turning them on in Options:
You may tighten up your regex by removing capturing groups:
(?:[a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}
Or even (?:[a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4}|\d{1,3}(?:\.\d{1,3}){3}
.
See a regex demo
And since the regex pattern does not contain capturing groups, re.findall
will only return matches, not capturing group contents:
import re
p = re.compile(r'(?:[a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
test_str = "from mail.example.com (example.com. [213.239.250.131]) by\n mx.google.com with ESMTPS id xc4si15480310lbb.82.2014.10.26.06.16.58 for\n <[email protected]> (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256\n bits=128/128); Sun, 26 Oct 2014 06:16:58 -0700 (PDT)"
print(re.findall(p, test_str))
Output of the online Python demo:
['213.239.250.131', '014.10.26.06']
these are the capturing groups. if you do or queries it will return empty matches for the non matching expressions.
(([a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4})|(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})
the first or has 2 groups:(([a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4})
and after the or there is the third:(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})
to say it in a simple way each round bracket defines a capturing group which will show up if the value matches or not.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With