I am trying to find a simple way to extract an unknown substring (could be anything) that appear between two known substrings. For example, I have a string:
a<-" anything goes here, STR1 GET_ME STR2, anything goes here"
I need to extract the string GET_ME
which is between STR1 and STR2 (without the white spaces).
I am trying str_extract(a, "STR1 (.+) STR2")
, but I am getting the entire match
[1] "STR1 GET_ME STR2"
I can of course strip the known strings, to isolate the substring I need, but I think there should be a cleaner way to do it by using a correct regular expression.
Extracting a String Between 2 Characters in R Find the position of the final character and subtract 1 from it – that is the final position of the desired string. Use the substr() function to extract the desired string inclusively between the initial position and final position as found in Steps 1-2.
The substring function in R can be used either to extract parts of character strings, or to change the values of parts of character strings. substring of a vector or column in R can be extracted using substr() function. To extract the substring of the column in R we use functions like substr() and substring().
To extract words from a string vector, we can use word function of stringr package. For example, if we have a vector called x that contains 100 words then first 20 words can be extracted by using the command word(x,start=1,end=20,sep=fixed(" ")).
The stringr package provide a cohesive set of functions designed to make working with strings as easy as possible. If you're not familiar with strings, the best place to start is the chapter on strings in R for Data Science.
You may use str_match
with STR1 (.*?) STR2
(note the spaces are "meaningful", if you want to just match anything in between STR1
and STR2
use STR1(.*?)STR2
, or use STR1\\s*(.*?)\\s*STR2
to trim the value you need). If you have multiple occurrences, use str_match_all
.
Also, if you need to match strings that span across line breaks/newlines add (?s)
at the start of the pattern: (?s)STR1(.*?)STR2
/ (?s)STR1\\s*(.*?)\\s*STR2
.
library(stringr) a <- " anything goes here, STR1 GET_ME STR2, anything goes here" res <- str_match(a, "STR1\\s*(.*?)\\s*STR2") res[,2] [1] "GET_ME"
Another way using base R regexec
(to get the first match):
test <- " anything goes here, STR1 GET_ME STR2, anything goes here STR1 GET_ME2 STR2" pattern <- "STR1\\s*(.*?)\\s*STR2" result <- regmatches(test, regexec(pattern, test)) result[[1]][2] [1] "GET_ME"
Here's another way by using base R
a<-" anything goes here, STR1 GET_ME STR2, anything goes here" gsub(".*STR1 (.+) STR2.*", "\\1", a)
Output:
[1] "GET_ME"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With