Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract first sentence in string

Tags:

regex

r

stringr

I want to extract the first sentence from following with regex. The rule I want to implement (which I know won't be universal solution) is to extract from string start ^ up to (including) the first period/exclamation/question mark that is preceded by a lowercase letter or number.

require(stringr)

x = "Bali bombings: U.S. President George W. Bush amongst many others has condemned the perpetrators of the Bali car bombing of October 11. The death toll has now risen to at least 187."

My best guess so far has been to try and implement a non-greedy string-before-match approach fails in this case:

str_extract(x, '.+?(?=[a-z0-9][.?!] )')
[1] NA

Any tips much appreciated.

like image 512
geotheory Avatar asked Feb 20 '18 12:02

geotheory


People also ask

How do I extract the first word of a string?

Extract the First Word Using Text Formulas The FIND part of the formula is used to find the position of the space character in the text string. When the formula finds the position of the space character, the LEFT function is used to extract all the characters before that first space character in the text string.


2 Answers

You put the [a-z0-9][.?!] into a non-consuming lookahead pattern, you need to make it consuming if you plan to use str_extract:

> str_extract(x, '.*?[a-z0-9][.?!](?= )')
[1] "Bali bombings: U.S. President George W. Bush amongst many others has condemned the perpetrators of the Bali car bombing of October 11."

See this regex demo.

Details

  • .*? - any 0+ chars other than line break chars
  • [a-z0-9] - an ASCII lowercase letter or a digit
  • [.?!] - a ., ? or !
  • (?= ) - that is followed with a literal space.

Alternatively, you may use sub:

sub("([a-z0-9][?!.])\\s.*", "\\1", x)

See this regex demo.

Details

  • ([a-z0-9][?!.]) - Group 1 (referred to with \1 from the replacement pattern): an ASCII lowercase letter or digit and then a ?, ! or .
  • \s - a whitespace
  • .* - any 0+ chars, as many as possible (up to the end of string).
like image 68
Wiktor Stribiżew Avatar answered Sep 29 '22 11:09

Wiktor Stribiżew


corpus has special handling for abbreviations when determining sentence boundaries:

library(corpus)       
text_split(x, "sentences")
#>   parent index text                                                                                                                           
#> 1 1          1 Bali bombings: U.S. President George W. Bush amongst many others #> has condemned the perpetrators of the Bali car bombing of Oct…
#> 2 1          2 The death toll has now risen to at least 187.  

There's also useful dataset with common abbreviations for many languages including English. See corpus::abbreviations_en, which can be used for disambiguating the sentence boundaries.

like image 36
dmi3kno Avatar answered Sep 29 '22 09:09

dmi3kno