I have been trying to teach myself Regexes in python and I decided to print out all the sentences of a text. I have been tinkering with the regular expressions for the past 3 hours to no avail. I just tried the following but couldn't do anything. <pre class="prettyprint"><code>p = open('anan.txt') process = p.read() regexMatch = re.findall('^[A-Z].+\s+[.!?]$',process,re.I) print regexMatch p.close() </code></pre> My input file is like this: <pre class="prettyprint"><code>OMG is this a question ! Is this a sentence ? My. name is. </code></pre> This prints no outputs. But when I remove "My. name is.", it prints OMG is this a question and Is this a sentence together as if it only reads the first line. What is the best solution of regex that can find all sentences in a text file - regardless if the sentence carries to new line or so - and also reads the entire text? Thanks.

There are two issues in your regex: <ol> <li>Your expression is anchored by <code>^</code> and <code>$</code>, which are the "start of line" and "end of line" anchors, respectively. That means that your pattern is looking to match an entire line of your text.</li> <li>You are searching for <code>\s+</code> before your punctuation character, which specifies one or more whitespace character. If you don't have whitespace before your punctuation, the expression will not match.</li> </ol>

Regex to find all sentences of text?

Tags:

python

regex

I have been trying to teach myself Regexes in python and I decided to print out all the sentences of a text. I have been tinkering with the regular expressions for the past 3 hours to no avail.

I just tried the following but couldn't do anything.

p = open('anan.txt')
process = p.read()
regexMatch = re.findall('^[A-Z].+\s+[.!?]$',process,re.I)
print regexMatch
p.close()

My input file is like this:

OMG is this a question ! Is this a sentence ? My.
name is.

This prints no outputs. But when I remove "My. name is.", it prints OMG is this a question and Is this a sentence together as if it only reads the first line.

What is the best solution of regex that can find all sentences in a text file - regardless if the sentence carries to new line or so - and also reads the entire text? Thanks.

832

asked Aug 23 '10 15:08

sarevok

2 Answers

Something like this works:

## pattern: Upercase, then anything that is not in (.!?), then one of them
>>> pat = re.compile(r'([A-Z][^\.!?]*[\.!?])', re.M)
>>> pat.findall('OMG is this a question ! Is this a sentence ? My. name is.')
['OMG is this a question !', 'Is this a sentence ?', 'My.']

Notice how name is. is not in the result because it does not start with a uppercase letter.

Your problem comes from the use of the ^$ anchors, they work on the whole text.

191

answered Sep 25 '22 22:09

Jochen Ritzel

There are two issues in your regex:

Your expression is anchored by ^ and $, which are the "start of line" and "end of line" anchors, respectively. That means that your pattern is looking to match an entire line of your text.
You are searching for \s+ before your punctuation character, which specifies one or more whitespace character. If you don't have whitespace before your punctuation, the expression will not match.

answered Sep 23 '22 22:09

Daniel Vandersluis

Related questions
                            
                                Is LINQ (or linq) a niche tool, or is it on the path to becoming foundational?
                            
                                How can I specify that some command line arguments are mandatory in Python?
                            
                                Is there anyway to persuade python's getopt to handle optional parameters to options?
                            
                                How to interpret status code in Python commands.getstatusoutput()
                            
                                Releasing Python GIL while in C++ code
                            
                                UnicodeDecodeError with Django's request.FILES
                            
                                Correct place to put extra startup code in django?
                            
                                Does anyone had success getting Django to send emails when hosting on Dreamhost?
                            
                                Pyparsing - where order of tokens in unpredictable
                            
                                Why is this genexp performing worse than a list comprehension?
                            
                                python: inserting a variable value into a variable name
                            
                                How to convert a numeric string with place-value commas into an integer?
                            
                                Python AST processing
                            
                                PyParsing: What does Combine() do?
                            
                                how to remove an element from a nested list?
                            
                                How to return an image in an HTTP response with CherryPy
                            
                                How do I interleave strings in Python? [closed]
                            
                                Python how to read and split a line to several integers
                            
                                Convert Unix Timestamp to human format in Django with Python
                            
                                Pydoc is not working (Windows XP)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With