Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split string into sentences using regex

Tags:

I have random text stored in $sentences. Using regex, I want to split the text into sentences, see:

function splitSentences($text) {     $re = '/                # Split sentences on whitespace between them.         (?<=                # Begin positive lookbehind.           [.!?]             # Either an end of sentence punct,         | [.!?][\'"]        # or end of sentence punct and quote.         )                   # End positive lookbehind.         (?<!                # Begin negative lookbehind.           Mr\.              # Skip either "Mr."         | Mrs\.             # or "Mrs.",         | T\.V\.A\.         # or "T.V.A.",                             # or... (you get the idea).         )                   # End negative lookbehind.         \s+                 # Split on whitespace between sentences.         /ix';      $sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);     return $sentences; }  $sentences = splitSentences($sentences);  print_r($sentences); 

It works fine.

However, it doesn't split into sentences if there are unicode characters:

$sentences = 'Entertainment media properties. Fairy Tail and Tokyo Ghoul.'; 

Or this scenario:

$sentences = "Entertainment media properties.&Acirc;&nbsp; Fairy Tail and Tokyo Ghoul."; 

What can I do to make it work when unicode characters exist in the text?

Here is an ideone for testing.

Bounty info

I am looking for a complete solution to this. Before posting an answer, please read the comment thread I had with WiktorStribiżew for more relevant info on this issue.

like image 808
Henrik Petterson Avatar asked Jan 19 '16 16:01

Henrik Petterson


People also ask

How do you split a sentence in regex?

A character inserted as separator/delimiter between elements when collapsing multi-element strings of x . Default: sep = " " (i.e., insert 1 space between elements). Sentence delimiters (as regex) used to split the collapsed string of x into substrings. Default: split_delim = "\.

How split a string in regex?

split(String regex) method splits this string around matches of the given regular expression. This method works in the same way as invoking the method i.e split(String regex, int limit) with the given expression and a limit argument of zero. Therefore, trailing empty strings are not included in the resulting array.

How do you separate a string in a sentence?

Splitting a string by sentence as a delimiter You can also split a sentence by passing a sentence as a delimiter if you do so each time the specified sentence occurs the String is divided as a separate token.

Does split accept regex?

You do not only have to use literal strings for splitting strings into an array with the split method. You can use regex as breakpoints that match more characters for splitting a string.


2 Answers

As it should be expected, any sort of natural language processing is not a trivial task. The reason for it is that they are evolutionary systems. There is no single person who sat down and thought about which are good ideas and which - not. Every rule has 20-40% exceptions. With that said the complexity of a single regex that can do your bidding would be off the charts. Still, the following solution relies mainly on regexes.


  • The idea is to gradually go over the text.
  • At any given time, the current chunk of the text will be contained in two different parts. One, which is the candidate for a substring before a sentence boundary and another - after.
  • The first 10 regex pairs detect positions which look like sentence boundaries, but actually aren't. In that case, before and after are advanced without registering a new sentence.
  • If none of these pairs matches, matching will be attempted with the last 3 pairs, possibly detecting a boundary.

As for where did these regexes come from? - I translated this Ruby library, which is generated based on this paper. If you truly want to understand them, there is no alternative but to read the paper.

As far as accuracy goes - I encourage you to test it with different texts. After some experimentation, I was very pleasantly surprised.

In terms of performance - the regexes should be highly performant as all of them have either a \A or \Z anchor, there are almost no repetition quantifiers, and in the places there are - there can't be any backtracking. Still, regexes are regexes. You will have to do some benchmarking if you plan to use this is tight loops on huge chunks of text.


Mandatory disclaimer: excuse my rusty php skills. The following code might not be the most idiomatic php ever, it should still be clear enough to get the point across.


function sentence_split($text) {     $before_regexes = array('/(?:(?:[\'\"„][\.!?…][\'\"”]\s)|(?:[^\.]\s[A-Z]\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s[A-Z]\.\s)|(?:\bApr\.\s)|(?:\bAug\.\s)|(?:\bBros\.\s)|(?:\bCo\.\s)|(?:\bCorp\.\s)|(?:\bDec\.\s)|(?:\bDist\.\s)|(?:\bFeb\.\s)|(?:\bInc\.\s)|(?:\bJan\.\s)|(?:\bJul\.\s)|(?:\bJun\.\s)|(?:\bMar\.\s)|(?:\bNov\.\s)|(?:\bOct\.\s)|(?:\bPh\.?D\.\s)|(?:\bSept?\.\s)|(?:\b\p{Lu}\.\p{Lu}\.\s)|(?:\b\p{Lu}\.\s\p{Lu}\.\s)|(?:\bcf\.\s)|(?:\be\.g\.\s)|(?:\besp\.\s)|(?:\bet\b\s\bal\.\s)|(?:\bvs\.\s)|(?:\p{Ps}[!?]+\p{Pe} ))\Z/su',         '/(?:(?:[\.\s]\p{L}{1,2}\.\s))\Z/su',         '/(?:(?:[\[\(]*\.\.\.[\]\)]* ))\Z/su',         '/(?:(?:\b(?:pp|[Vv]iz|i\.?\s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c\.?\s*f|vs)\.\s))\Z/su',         '/(?:(?:\b[Ee]tc\.\s))\Z/su',         '/(?:(?:[\.!?…]+\p{Pe} )|(?:[\[\(]*…[\]\)]* ))\Z/su',         '/(?:(?:\b\p{L}\.))\Z/su',         '/(?:(?:\b\p{L}\.\s))\Z/su',         '/(?:(?:\b[Ff]igs?\.\s)|(?:\b[nN]o\.\s))\Z/su',         '/(?:(?:[\"”\']\s*))\Z/su',         '/(?:(?:[\.!?…][\x{00BB}\x{2019}\x{201D}\x{203A}\"\'\p{Pe}\x{0002}]*\s)|(?:\r?\n))\Z/su',         '/(?:(?:[\.!?…][\'\"\x{00BB}\x{2019}\x{201D}\x{203A}\p{Pe}\x{0002}]*))\Z/su',         '/(?:(?:\s\p{L}[\.!?…]\s))\Z/su');     $after_regexes = array('/\A(?:)/su',         '/\A(?:[\p{N}\p{Ll}])/su',         '/\A(?:[^\p{Lu}])/su',         '/\A(?:[^\p{Lu}]|I)/su',         '/\A(?:[^p{Lu}])/su',         '/\A(?:\p{Ll})/su',         '/\A(?:\p{L}\.)/su',         '/\A(?:\p{L}\.\s)/su',         '/\A(?:\p{N})/su',         '/\A(?:\s*\p{Ll})/su',         '/\A(?:)/su',         '/\A(?:\p{Lu}[^\p{Lu}])/su',         '/\A(?:\p{Lu}\p{Ll})/su');     $is_sentence_boundary = array(false, false, false, false, false, false, false, false, false, false, true, true, true);     $count = 13;      $sentences = array();     $sentence = '';     $before = '';     $after = substr($text, 0, 10);     $text = substr($text, 10);      while($text != '') {         for($i = 0; $i < $count; $i++) {             if(preg_match($before_regexes[$i], $before) && preg_match($after_regexes[$i], $after)) {                 if($is_sentence_boundary[$i]) {                     array_push($sentences, $sentence);                     $sentence = '';                 }                 break;             }         }          $first_from_text = $text[0];         $text = substr($text, 1);         $first_from_after = $after[0];         $after = substr($after, 1);         $before .= $first_from_after;         $sentence .= $first_from_after;         $after .= $first_from_text;     }      if($sentence != '' && $after != '') {         array_push($sentences, $sentence.$after);     }      return $sentences; }  $text = "Mr. Entertainment media properties. Fairy Tail 3.5 and Tokyo Ghoul."; print_r(sentence_split($text)); 
like image 137
ndnenkov Avatar answered Sep 17 '22 15:09

ndnenkov


  is what it looks like when you print a UTF-8 character U+00A0 Non-Breaking Space to a page/console being interpreted as Latin-1. So I think you have a non-breaking space between the sentences, not a normal space.

\s can match a non-breaking space too, but you will need to use the /u modifier to tell preg you are sending it a UTF-8-encoded string. Otherwise it, like your print command, will guess Latin-1 and see it as the two characters  .

like image 24
bobince Avatar answered Sep 21 '22 15:09

bobince