Possible Duplicate:
PHP - How to split a paragraph into sentences.
I have a block of text that I would like to separate into sentences, what would be the best way of doing this? I thought of looking for '.','!','?' characters, but I realized there were some problems with this, such as when people use acronyms, or end a sentence with something like !?. What would be the best way to handle this? I figured there would be some regex that could handle this, but I'm open to a non-regex solution if that fits the problem better.
Regex isn't the best solution for this problem. You'd be served better by creating a parsing library. Something where you an easily create logic blocks to distinguish one thing from another. You'll need to come up with a set of rules breaking up the text into the chunks you'd like to see.
"Are you sure?" he asked.
Doesn't that mess things up when using regex? However, with a parser you could actually see
<start quote><capitalization>are you sure<question><end quote>he asked<period>
that with simple rules could say "that's one sentence."
Unfortunately there is no perfect solution for this, for the very reasons you stated. If it is content that you can somehow control or force a specified delimiter after every sentence, that would be ideal. Beyond that, all you can really do is look for (\.|!|?)+
and maybe even throw in a \s after that since most people pad new sentences with 1 or 2 spaces between the previous and next sentence.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With