How to detect sentence boundaries with OpenNLP and stringi?

Question

I want to break next string into sentences:

library(NLP) # NLP_0.1-7  
string <- as.String("Mr. Brown comes. He says hello. i give him coffee.")

I want to demonstrate two different ways. One comes from package openNLP:

library(openNLP) # openNLP_0.2-5  

sentence_token_annotator <- Maxent_Sent_Token_Annotator(language = "en")  
boundaries_sentences<-annotate(string, sentence_token_annotator)  
string[boundaries_sentences]  

[1] "Mr. Brown comes."   "He says hello."     "i give him coffee."

And second comes from package stringi:

library(stringi) # stringi_0.5-5  

stri_split_boundaries( string , opts_brkiter=stri_opts_brkiter('sentence'))

[[1]]  
 [1] "Mr. "                              "Brown comes. "                    
 [3] "He says hello. i give him coffee."

After this second way I need to prepare sentences to remove extra spaces or break a new string into sentences again. Can I adjust stringi function to improve result's quality?

When it is about a big data, openNLP is (very much) slower then stringi.
Is there a way to combine stringi (->fast) and openNLP (->quality)?

gagolews · Accepted Answer

Text boundary (in this case, sentence boundary) analysis in ICU (and thus in stringi) is governed by the rules described in Unicode UAX29, see also ICU Users Guide on the topic. We read:

[The Unicode rules] cannot detect cases such as “...Mr. Jones...”; more sophisticated tailoring would be required to detect such cases.

In other words, this cannot be done without a custom dictionary of non-stop words, which in fact is implemented in openNLP. A few possible scenarios to incorporate stringi for performing this task would therefore include:

Use stri_split_boundaries and then write a function deciding on which incorrectly split tokens should be joined.
Manually input non-breaking spaces into the text (possibly after dots following etc., Mr., i.e. and so on (note that this in fact is required when preparing documents in LaTeX -- otherwise you get too huge spaces between words).
Incorporate a custom non-stop word list into a regex and apply the stri_split_regex.

and so on.

Tyler Rinker · Answer

This may be a viable regex solution:

string <- "Mr. Brown comes. He says hello. i give him coffee."
stringi::stri_split_regex(string, "(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|\!)\s")

## [[1]]
## [1] "Mr. Brown comes."   "He says hello."     "i give him coffee."

Performs less well on:

string <- "Mr. Brown comes! He says hello. i give him coffee.  i will got at 5 p. m. eastern time.  Or somewhere in between"

How to detect sentence boundaries with OpenNLP and stringi?

Tags:

regex

r

text-mining

opennlp

stringi

SRRussel

2 Answers

gagolews

Tyler Rinker

Recent Activity

Donate For Us

How to detect sentence boundaries with OpenNLP and stringi?

Tags:

regex

r

text-mining

opennlp

stringi

SRRussel

2 Answers

gagolews

Tyler Rinker

Related questions

Recent Activity

Donate For Us