Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression for counting sentences in a block of text [duplicate]

Tags:

regex

php

nlp

Possible Duplicate:
PHP - How to split a paragraph into sentences.

I have a block of text that I would like to separate into sentences, what would be the best way of doing this? I thought of looking for '.','!','?' characters, but I realized there were some problems with this, such as when people use acronyms, or end a sentence with something like !?. What would be the best way to handle this? I figured there would be some regex that could handle this, but I'm open to a non-regex solution if that fits the problem better.

like image 331
GSto Avatar asked Sep 09 '10 15:09

GSto


2 Answers

Regex isn't the best solution for this problem. You'd be served better by creating a parsing library. Something where you an easily create logic blocks to distinguish one thing from another. You'll need to come up with a set of rules breaking up the text into the chunks you'd like to see.

"Are you sure?" he asked.

Doesn't that mess things up when using regex? However, with a parser you could actually see

<start quote><capitalization>are you sure<question><end quote>he asked<period>

that with simple rules could say "that's one sentence."

like image 87
wheaties Avatar answered Sep 28 '22 05:09

wheaties


Unfortunately there is no perfect solution for this, for the very reasons you stated. If it is content that you can somehow control or force a specified delimiter after every sentence, that would be ideal. Beyond that, all you can really do is look for (\.|!|?)+ and maybe even throw in a \s after that since most people pad new sentences with 1 or 2 spaces between the previous and next sentence.

like image 29
Crayon Violent Avatar answered Sep 28 '22 06:09

Crayon Violent