Extract first sentence in string

Tags:

I want to extract the first sentence from following with regex. The rule I want to implement (which I know won't be universal solution) is to extract from string start ^ up to (including) the first period/exclamation/question mark that is preceded by a lowercase letter or number.

require(stringr)

x = "Bali bombings: U.S. President George W. Bush amongst many others has condemned the perpetrators of the Bali car bombing of October 11. The death toll has now risen to at least 187."

My best guess so far has been to try and implement a non-greedy string-before-match approach fails in this case:

str_extract(x, '.+?(?=[a-z0-9][.?!] )')
[1] NA

Any tips much appreciated.

512

asked Feb 20 '18 12:02

geotheory

2 Answers

You put the [a-z0-9][.?!] into a non-consuming lookahead pattern, you need to make it consuming if you plan to use str_extract:

> str_extract(x, '.*?[a-z0-9][.?!](?= )')
[1] "Bali bombings: U.S. President George W. Bush amongst many others has condemned the perpetrators of the Bali car bombing of October 11."

See this regex demo.

Details

.*? - any 0+ chars other than line break chars
[a-z0-9] - an ASCII lowercase letter or a digit
[.?!] - a ., ? or !
(?= ) - that is followed with a literal space.

Alternatively, you may use sub:

sub("([a-z0-9][?!.])\\s.*", "\\1", x)

See this regex demo.

Details

([a-z0-9][?!.]) - Group 1 (referred to with \1 from the replacement pattern): an ASCII lowercase letter or digit and then a ?, ! or .
\s - a whitespace
.* - any 0+ chars, as many as possible (up to the end of string).

answered Sep 29 '22 11:09

Wiktor Stribiżew

corpus has special handling for abbreviations when determining sentence boundaries:

library(corpus)       
text_split(x, "sentences")
#>   parent index text                                                                                                                           
#> 1 1          1 Bali bombings: U.S. President George W. Bush amongst many others #> has condemned the perpetrators of the Bali car bombing of Oct…
#> 2 1          2 The death toll has now risen to at least 187.

There's also useful dataset with common abbreviations for many languages including English. See corpus::abbreviations_en, which can be used for disambiguating the sentence boundaries.

answered Sep 29 '22 09:09

dmi3kno

Related questions
                            
                                How can I prevent `<<- ` from assigning in the global environment?
                            
                                Need help speeding up a dplyr aggregation
                            
                                How to flush the print buffer in R?
                            
                                Merging all column by reference in a data.table
                            
                                How to use purrr for extracting elements from a list?
                            
                                Change background color of selectInput in R Shiny
                            
                                How do I use color in a geom_dotplot?
                            
                                Using mutate_at() with negated select helpers e.g(not one_of())
                            
                                ggplot2 - adding secondary y-axis with different breaks and labels
                            
                                mc.cores > 1 is not support on windows
                            
                                Rename columns using `starts_with()` where new prefix is a string
                            
                                dplyr: deselecting columns given by
                            
                                Convert number of days since Jan 1 2000 into date format
                            
                                reshape/melt an asymmetric matrix according to a rowKey
                            
                                is.atomic() vs is.vector()
                            
                                dplyr::select_if can use colnames and their values at the same time?
                            
                                Replace NA in all columns of a dplyr chain
                            
                                Get column names with zero variance using dplyr
                            
                                Extract city names from large text with R
                            
                                Extract portion of string startswith 4 digit number and ends with period

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Extract first sentence in string

Tags:

regex

r

stringr

geotheory

People also ask

2 Answers

Wiktor Stribiżew

dmi3kno

Recent Activity

Donate For Us