I want to break next string
into sentences:
library(NLP) # NLP_0.1-7
string <- as.String("Mr. Brown comes. He says hello. i give him coffee.")
I want to demonstrate two different ways. One comes from package openNLP
:
library(openNLP) # openNLP_0.2-5
sentence_token_annotator <- Maxent_Sent_Token_Annotator(language = "en")
boundaries_sentences<-annotate(string, sentence_token_annotator)
string[boundaries_sentences]
[1] "Mr. Brown comes." "He says hello." "i give him coffee."
And second comes from package stringi
:
library(stringi) # stringi_0.5-5
stri_split_boundaries( string , opts_brkiter=stri_opts_brkiter('sentence'))
[[1]]
[1] "Mr. " "Brown comes. "
[3] "He says hello. i give him coffee."
After this second way I need to prepare sentences to remove extra spaces or break a new string into sentences again. Can I adjust stringi function to improve result's quality?
When it is about a big data, openNLP
is (very much) slower then stringi
.
Is there a way to combine stringi
(->fast) and openNLP
(->quality)?
Text boundary (in this case, sentence boundary) analysis in ICU (and thus in stringi) is governed by the rules described in Unicode UAX29, see also ICU Users Guide on the topic. We read:
[The Unicode rules] cannot detect cases such as “...Mr. Jones...”; more sophisticated tailoring would be required to detect such cases.
In other words, this cannot be done without a custom dictionary of non-stop words, which in fact is implemented in openNLP
. A few possible scenarios to incorporate stringi for performing this task would therefore include:
stri_split_boundaries
and then write a function deciding on which incorrectly split tokens should be joined.stri_split_regex
.and so on.
This may be a viable regex solution:
string <- "Mr. Brown comes. He says hello. i give him coffee."
stringi::stri_split_regex(string, "(?<!\\w\\.\\w.)(?<![A-Z][a-z]\\.)(?<=\\.|\\?|\\!)\\s")
## [[1]]
## [1] "Mr. Brown comes." "He says hello." "i give him coffee."
Performs less well on:
string <- "Mr. Brown comes! He says hello. i give him coffee. i will got at 5 p. m. eastern time. Or somewhere in between"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With