I want to extract the first sentence from following with regex. The rule I want to implement (which I know won't be universal solution) is to extract from string start ^
up to (including) the first period/exclamation/question mark that is preceded by a lowercase letter or number.
require(stringr)
x = "Bali bombings: U.S. President George W. Bush amongst many others has condemned the perpetrators of the Bali car bombing of October 11. The death toll has now risen to at least 187."
My best guess so far has been to try and implement a non-greedy string-before-match approach fails in this case:
str_extract(x, '.+?(?=[a-z0-9][.?!] )')
[1] NA
Any tips much appreciated.
Extract the First Word Using Text Formulas The FIND part of the formula is used to find the position of the space character in the text string. When the formula finds the position of the space character, the LEFT function is used to extract all the characters before that first space character in the text string.
You put the [a-z0-9][.?!]
into a non-consuming lookahead pattern, you need to make it consuming if you plan to use str_extract
:
> str_extract(x, '.*?[a-z0-9][.?!](?= )')
[1] "Bali bombings: U.S. President George W. Bush amongst many others has condemned the perpetrators of the Bali car bombing of October 11."
See this regex demo.
Details
.*?
- any 0+ chars other than line break chars[a-z0-9]
- an ASCII lowercase letter or a digit[.?!]
- a .
, ?
or !
(?= )
- that is followed with a literal space.Alternatively, you may use sub
:
sub("([a-z0-9][?!.])\\s.*", "\\1", x)
See this regex demo.
Details
([a-z0-9][?!.])
- Group 1 (referred to with \1
from the replacement pattern): an ASCII lowercase letter or digit and then a ?
, !
or .
\s
- a whitespace.*
- any 0+ chars, as many as possible (up to the end of string).corpus
has special handling for abbreviations when determining sentence boundaries:
library(corpus)
text_split(x, "sentences")
#> parent index text
#> 1 1 1 Bali bombings: U.S. President George W. Bush amongst many others #> has condemned the perpetrators of the Bali car bombing of Oct…
#> 2 1 2 The death toll has now risen to at least 187.
There's also useful dataset with common abbreviations for many languages including English. See corpus::abbreviations_en
, which can be used for disambiguating the sentence boundaries.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With