Regex: Why do empty strings get included (in a list of tuples) in re.findall()?

Tags:

regex

According to the pattern match here, the matches are 213.239.250.131 and 014.10.26.06.

Yet when I run the generated Python code and print out the value of re.findall(p, test_str), I get:

[('', '', '213.239.250.131'), ('', '', '014.10.26.06')]

I could hack around the list and it tuples to get the values I'm looking for (the IP addresses), but (i) they might not always be in the same position in the tuples and (ii) I'd rather understand what's going on here so I can either tighten up the regex, or extract only IP addresses using Python's own re functionality.

Why do I get this list of tuples, why the apparent whitespace matches, and how do we ensure that only the IP addresses are returned?

244

asked Jun 11 '15 21:06

Pyderman

2 Answers

Whenever you are using a capturing group, it always returns a submatch, even if it is empty/null. You have 3 capturing groups, so you will always have them in the findall result.

In regex101.com, you can see these non-participating groups by turning them on in Options:

enter image description here

You may tighten up your regex by removing capturing groups:

(?:[a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}

Or even (?:[a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4}|\d{1,3}(?:\.\d{1,3}){3}.

See a regex demo

And since the regex pattern does not contain capturing groups, re.findall will only return matches, not capturing group contents:

import re
p = re.compile(r'(?:[a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
test_str = "from mail.example.com (example.com. [213.239.250.131]) by\n mx.google.com with ESMTPS id xc4si15480310lbb.82.2014.10.26.06.16.58 for\n <[email protected]> (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256\n bits=128/128); Sun, 26 Oct 2014 06:16:58 -0700 (PDT)"
print(re.findall(p, test_str))

Output of the online Python demo:

['213.239.250.131', '014.10.26.06']

151

answered Sep 30 '22 16:09

Wiktor Stribiżew

these are the capturing groups. if you do or queries it will return empty matches for the non matching expressions.

(([a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4})|(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})

the first or has 2 groups:
(([a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4})

and after the or there is the third:
(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})

to say it in a simple way each round bracket defines a capturing group which will show up if the value matches or not.

answered Sep 30 '22 18:09

aschmid00

Related questions
                            
                                Selenium code to wait until CSS class is available and extract text in Python
                            
                                Cython module doesn't work
                            
                                Is it possible to do parallel reads on one h5py file using multiprocessing?
                            
                                Installed packages with pip are not shown in pip freeze?
                            
                                Write ranges of numbers with dashes
                            
                                Save full text of a tweet with tweepy
                            
                                Matplotlib show multiple images with for loop [duplicate]
                            
                                Python docker-py Connection Refused
                            
                                multivariate student t-distribution with python
                            
                                What should I decorate with @asyncio.coroutine for async operations?
                            
                                Escape character \t behaves differently with space
                            
                                Get current line in Sublime Text 3 plugin
                            
                                how to handle javascript alerts in selenium using python
                            
                                Unprint a line on the console in Python?
                            
                                Python3 - When exactly do you need to prepend "self._" to variable declarations within class methods? [duplicate]
                            
                                Delete rows at select indexes from a numpy array
                            
                                What does this function do? (Python iterators)
                            
                                Python mock patch instance method and check call arguments
                            
                                Database first Django models
                            
                                Expanding tuples in list comprehension generator

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With