Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

re.findall() isn't as greedy as expected - Python 2.7

I am attempting to pull a list of complete sentences out of a body of plaintext using a regular expression in python 2.7. For my purposes, it is not important that everything that could be construed as a complete sentence should be in the list, but everything in the list does need to be a complete sentence. Below is the code that will illustrate the issue:

import re
text = "Hello World! This is your captain speaking."
sentences = re.findall("[A-Z]\w+(\s+\w+[,;:-]?)*[.!?]", text)
print sentences

Per this regex tester, I should, in theory, be getting a list like this:

>>> ["Hello World!", "This is your captain speaking."]

But the output I am actually getting is like this:

>>> [' World', ' speaking']

The documentation indicates that the findall searches from left to right and that the * and + operators are handled greedily. Appreciate the help.

like image 494
Lee Richards Avatar asked Mar 09 '23 04:03

Lee Richards


1 Answers

The issue is that findall() is showing just the captured subgroups rather than the full match. Per the docs for re.findall():

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

It is easy to see what is going on using re.finditer() and exploring the match objects:

>>> import re
>>> text = "Hello World! This is your captain speaking."

>>> it = re.finditer("[A-Z]\w+(\s+\w+[,;:-]?)*[.!?]", text)

>>> mo = next(it)
>>> mo.group(0)
'Hello World!'
>>> mo.groups()
(' World',)

>>> mo = next(it)
>>> mo.group(0)
'This is your captain speaking.'
>>> mo.groups()
(' speaking',)

The solution to your problem is to suppress the subgroups with ?:. Then you get the expected results:

>>> re.findall("[A-Z]\w+(?:\s+\w+[,;:-]?)*[.!?]", text)
['Hello World!', 'This is your captain speaking.'
like image 140
Raymond Hettinger Avatar answered Mar 19 '23 22:03

Raymond Hettinger