I'd like to extract text between two strings for all occurrences of a pattern. For example, I have this string:
x<- "\nTYPE: School\nCITY: ATLANTA\n\n\nCITY: LAS VEGAS\n\n"
I'd like to extract the words ATLANTA
and LAS VEGAS
as such:
[1] "ATLANTA" "LAS VEGAS"
I tried using gsub(".*CITY:\\s|\n","",x)
. The output this yields is:
[1] " LAS VEGAS"
I would like to output both cities (some patterns in the data include more than 2 cities) and to output them without the leading space.
I also tried the qdapRegex package but could not get close. I am not that good with regular expressions so help would be much appreciated.
You may use
> unlist(regmatches(x, gregexpr("CITY:\\s*\\K.*", x, perl=TRUE)))
[1] "ATLANTA" "LAS VEGAS"
Here, CITY:\s*\K.*
regex matches
CITY:
- a literal substring CITY:
\s*
- 0+ whitespaces\K
- match reset operator that discards the text matched so far (zeros the current match memory buffer).*
- any 0+ chars other than line break chars, as many as possible.See the regex demo online.
Note that since it is a PCRE regex, perl=TRUE
is indispensible.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With