Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex: Why do empty strings get included (in a list of tuples) in re.findall()?

Tags:

python

regex

According to the pattern match here, the matches are 213.239.250.131 and 014.10.26.06.

Yet when I run the generated Python code and print out the value of re.findall(p, test_str), I get:

[('', '', '213.239.250.131'), ('', '', '014.10.26.06')]

I could hack around the list and it tuples to get the values I'm looking for (the IP addresses), but (i) they might not always be in the same position in the tuples and (ii) I'd rather understand what's going on here so I can either tighten up the regex, or extract only IP addresses using Python's own re functionality.

Why do I get this list of tuples, why the apparent whitespace matches, and how do we ensure that only the IP addresses are returned?

like image 244
Pyderman Avatar asked Jun 11 '15 21:06

Pyderman


People also ask

What does the function re Findall regex any string do?

The findall() function scans the string from left to right and finds all the matches of the pattern in the string . The result of the findall() function depends on the pattern: If the pattern has no capturing groups, the findall() function returns a list of strings that match the whole pattern.

Why does re Findall return a list?

The re. findall(pattern, string) method scans string from left to right, searching for all non-overlapping matches of the pattern . It returns a list of strings in the matching order when scanning the string from left to right.

What Findall () function will do?

findall() finds *all* the matches and returns them as a list of strings, with each string representing one match.

What is the difference between Finditer and Findall?

But finditer and findall are finding different things. Findall indeed finds all the matches in the given string. But finditer only finds the first one, returning an iterator with only one element.


2 Answers

Whenever you are using a capturing group, it always returns a submatch, even if it is empty/null. You have 3 capturing groups, so you will always have them in the findall result.

In regex101.com, you can see these non-participating groups by turning them on in Options:

enter image description here

You may tighten up your regex by removing capturing groups:

(?:[a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}

Or even (?:[a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4}|\d{1,3}(?:\.\d{1,3}){3}.

See a regex demo

And since the regex pattern does not contain capturing groups, re.findall will only return matches, not capturing group contents:

import re
p = re.compile(r'(?:[a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
test_str = "from mail.example.com (example.com. [213.239.250.131]) by\n mx.google.com with ESMTPS id xc4si15480310lbb.82.2014.10.26.06.16.58 for\n <[email protected]> (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256\n bits=128/128); Sun, 26 Oct 2014 06:16:58 -0700 (PDT)"
print(re.findall(p, test_str))

Output of the online Python demo:

['213.239.250.131', '014.10.26.06']
like image 151
Wiktor Stribiżew Avatar answered Sep 30 '22 16:09

Wiktor Stribiżew


these are the capturing groups. if you do or queries it will return empty matches for the non matching expressions.

(([a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4})|(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})

the first or has 2 groups:
(([a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4})

and after the or there is the third:
(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})

to say it in a simple way each round bracket defines a capturing group which will show up if the value matches or not.

like image 22
aschmid00 Avatar answered Sep 30 '22 18:09

aschmid00