This is my sample text:
text = "First sentence. This is a second sentence. I like pets e.g. cats or birds."
I have a function which splits texts by sentence
library(stringi)
split_by_sentence <- function (text) {
# split based on periods, exclams or question marks
result <- unlist(strsplit(text, "\\.\\s|\\?|!") )
result <- stri_trim_both(result)
result <- result [nchar (result) > 0]
if (length (result) == 0)
result <- ""
return (result)
}
which actually splits by punctuation characters. This is the output:
> split_by_sentence(text)
[1] "First sentence" "This is a second sentence" "I like pets e.g" "cats or birds."
Is there a possibility to exclude special patterns like "e.g."?
In your pattern, you can specify that you want to split at any punctuation mark that is followed by a space, if there is at least 2 alphanumeric characters prior to it (using lookaround). Which will result in:
unlist(strsplit(text, "(?<=[[:alnum:]]{3})[?!.]\\s", perl=TRUE))
#[1] "First sentence" "This is a second sentence" "I like pets e.g. cats or birds."
If you want to keep the punctuation marks, then you can add the pattern inside the look-behind and only split on space:
unlist(strsplit(text, "(?<=[[:alnum:]]{3}[[?!.]])\\s", perl=TRUE))
# [1] "First sentence." "This is a second sentence." "I like pets e.g. cats or birds."
text2 <- "I like pets (cats and birds) and horses. I have 1.8 bn. horses."
unlist(strsplit(text2, "(?<=[[:alnum:]]{3}[?!.])\\s", perl=TRUE))
#[1] "I like pets (cats and birds) and horses." "I have 1.8 bn. horses."
N.B.: If you may have more than one space after the punctuation mark, you can put \\s+
instead of \\s
in the pattern
library(tokenizers)
text = "First sentence. This is a second sentence. I like pets e.g. cats or birds."
tokenize_sentences(text)
Output is:
[[1]]
[1] "First sentence." "This is a second sentence." "I like pets e.g. cats or birds."
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With