Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to match . (periods marking end of sentences) but not Mr. (as in Mr. Hopkins)

I'm trying to parse a text file into sentences ending in periods, but names like Mr. Hopkins are throwing false alarms on matching for periods.

What regex identifies "." but not "Mr."

For bonus, I'm also using ! to find end of sentences, so my current Regex is /(!/./ and I'd love an answer that incorporates my !'s too.

like image 293
Josh Crews Avatar asked May 31 '10 21:05

Josh Crews


2 Answers

Use negative look behind.

(?<!Mr|Mrs|Dr|Ms)\.

This will match a period only if it does not come after Mr, Mrs, Dr or Ms

<?
   $str = "This is Mr. Someone and Mrs. Somebody. They are here to meet Dr. SomeoneElse.";
   $str = preg_replace("/(?<!Mr|Mrs|Dr|Ms)\\./", "\n", $str);
   echo($str);
?>
//outputs:
This is Mr. Someone and Mrs. Somebody
 They are here to meet Dr. SomeoneElse
like image 194
Amarghosh Avatar answered Oct 23 '22 18:10

Amarghosh


This can't be done with any simple mechanism. It's hopelessly ambiguous. Sentences can end with abbreviations, and in those cases they aren't written with two periods.

See Unicode TR29. Also see the ICU open source library, which includes a basic implementation.

like image 25
bmargulies Avatar answered Oct 23 '22 18:10

bmargulies