Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to find all sentences of text?

Tags:

python

regex

I have been trying to teach myself Regexes in python and I decided to print out all the sentences of a text. I have been tinkering with the regular expressions for the past 3 hours to no avail.

I just tried the following but couldn't do anything.

p = open('anan.txt')
process = p.read()
regexMatch = re.findall('^[A-Z].+\s+[.!?]$',process,re.I)
print regexMatch
p.close()

My input file is like this:

OMG is this a question ! Is this a sentence ? My.
name is.

This prints no outputs. But when I remove "My. name is.", it prints OMG is this a question and Is this a sentence together as if it only reads the first line.

What is the best solution of regex that can find all sentences in a text file - regardless if the sentence carries to new line or so - and also reads the entire text? Thanks.

like image 832
sarevok Avatar asked Aug 23 '10 15:08

sarevok


People also ask

What does this mean in regex ([ ])\ 1?

This is the opening HTML tag. (Since HTML tags are case insensitive, this regex requires case insensitive matching.) The backreference \1 (backslash one) references the first capturing group. \1 matches the exact same text that was matched by the first capturing group. The / before it is a literal character.

What does \W in regex include?

In regex, the uppercase metacharacter denotes the inverse of the lowercase counterpart, for example, \w for word character and \W for non-word character; \d for digit and \D or non-digit.

How do you check for multiple words in a regular expression?

However, to recognize multiple words in any order using regex, I'd suggest the use of quantifier in regex: (\b(james|jack)\b. *){2,} . Unlike lookaround or mode modifier, this works in most regex flavours.

What is '?' In regex?

The '?' means match zero or one space. This will match "Kaleidoscope", as well as all the misspellings that are common, the [] meaning match any of the alternatives within the square brackets.


2 Answers

Something like this works:

## pattern: Upercase, then anything that is not in (.!?), then one of them
>>> pat = re.compile(r'([A-Z][^\.!?]*[\.!?])', re.M)
>>> pat.findall('OMG is this a question ! Is this a sentence ? My. name is.')
['OMG is this a question !', 'Is this a sentence ?', 'My.']

Notice how name is. is not in the result because it does not start with a uppercase letter.

Your problem comes from the use of the ^$ anchors, they work on the whole text.

like image 191
Jochen Ritzel Avatar answered Sep 25 '22 22:09

Jochen Ritzel


There are two issues in your regex:

  1. Your expression is anchored by ^ and $, which are the "start of line" and "end of line" anchors, respectively. That means that your pattern is looking to match an entire line of your text.
  2. You are searching for \s+ before your punctuation character, which specifies one or more whitespace character. If you don't have whitespace before your punctuation, the expression will not match.
like image 35
Daniel Vandersluis Avatar answered Sep 23 '22 22:09

Daniel Vandersluis