Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to detect sentence boundaries with OpenNLP and stringi?

I want to break next string into sentences:

library(NLP) # NLP_0.1-7  
string <- as.String("Mr. Brown comes. He says hello. i give him coffee.")

I want to demonstrate two different ways. One comes from package openNLP:

library(openNLP) # openNLP_0.2-5  

sentence_token_annotator <- Maxent_Sent_Token_Annotator(language = "en")  
boundaries_sentences<-annotate(string, sentence_token_annotator)  
string[boundaries_sentences]  

[1] "Mr. Brown comes."   "He says hello."     "i give him coffee."  

And second comes from package stringi:

library(stringi) # stringi_0.5-5  

stri_split_boundaries( string , opts_brkiter=stri_opts_brkiter('sentence'))

[[1]]  
 [1] "Mr. "                              "Brown comes. "                    
 [3] "He says hello. i give him coffee."

After this second way I need to prepare sentences to remove extra spaces or break a new string into sentences again. Can I adjust stringi function to improve result's quality?

When it is about a big data, openNLP is (very much) slower then stringi.
Is there a way to combine stringi (->fast) and openNLP (->quality)?

like image 524
SRRussel Avatar asked Aug 06 '15 20:08

SRRussel


2 Answers

Text boundary (in this case, sentence boundary) analysis in ICU (and thus in stringi) is governed by the rules described in Unicode UAX29, see also ICU Users Guide on the topic. We read:

[The Unicode rules] cannot detect cases such as “...Mr. Jones...”; more sophisticated tailoring would be required to detect such cases.

In other words, this cannot be done without a custom dictionary of non-stop words, which in fact is implemented in openNLP. A few possible scenarios to incorporate stringi for performing this task would therefore include:

  1. Use stri_split_boundaries and then write a function deciding on which incorrectly split tokens should be joined.
  2. Manually input non-breaking spaces into the text (possibly after dots following etc., Mr., i.e. and so on (note that this in fact is required when preparing documents in LaTeX -- otherwise you get too huge spaces between words).
  3. Incorporate a custom non-stop word list into a regex and apply the stri_split_regex.

and so on.

like image 106
gagolews Avatar answered Nov 09 '22 11:11

gagolews


This may be a viable regex solution:

string <- "Mr. Brown comes. He says hello. i give him coffee."
stringi::stri_split_regex(string, "(?<!\\w\\.\\w.)(?<![A-Z][a-z]\\.)(?<=\\.|\\?|\\!)\\s")

## [[1]]
## [1] "Mr. Brown comes."   "He says hello."     "i give him coffee."

Performs less well on:

string <- "Mr. Brown comes! He says hello. i give him coffee.  i will got at 5 p. m. eastern time.  Or somewhere in between"
like image 36
Tyler Rinker Avatar answered Nov 09 '22 12:11

Tyler Rinker