Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I extract text between two characters in R

Tags:

string

regex

r

I'd like to extract text between two strings for all occurrences of a pattern. For example, I have this string:

x<- "\nTYPE:    School\nCITY:   ATLANTA\n\n\nCITY:   LAS VEGAS\n\n" 

I'd like to extract the words ATLANTA and LAS VEGAS as such:

[1] "ATLANTA"   "LAS VEGAS"

I tried using gsub(".*CITY:\\s|\n","",x). The output this yields is:

[1] "  LAS VEGAS"

I would like to output both cities (some patterns in the data include more than 2 cities) and to output them without the leading space.
I also tried the qdapRegex package but could not get close. I am not that good with regular expressions so help would be much appreciated.

like image 740
Cordy Avatar asked Mar 05 '23 18:03

Cordy


1 Answers

You may use

> unlist(regmatches(x, gregexpr("CITY:\\s*\\K.*", x, perl=TRUE)))
[1] "ATLANTA"   "LAS VEGAS"

Here, CITY:\s*\K.* regex matches

  • CITY: - a literal substring CITY:
  • \s* - 0+ whitespaces
  • \K - match reset operator that discards the text matched so far (zeros the current match memory buffer)
  • .* - any 0+ chars other than line break chars, as many as possible.

See the regex demo online.

Note that since it is a PCRE regex, perl=TRUE is indispensible.

like image 63
Wiktor Stribiżew Avatar answered Mar 16 '23 12:03

Wiktor Stribiżew