Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting a string between other two strings in R

Tags:

regex

r

stringr

I am trying to find a simple way to extract an unknown substring (could be anything) that appear between two known substrings. For example, I have a string:

a<-" anything goes here, STR1 GET_ME STR2, anything goes here"

I need to extract the string GET_ME which is between STR1 and STR2 (without the white spaces).

I am trying str_extract(a, "STR1 (.+) STR2"), but I am getting the entire match

[1] "STR1 GET_ME STR2" 

I can of course strip the known strings, to isolate the substring I need, but I think there should be a cleaner way to do it by using a correct regular expression.

like image 843
Sasha Avatar asked Aug 22 '16 18:08

Sasha


People also ask

How do I extract text from two characters in R?

Extracting a String Between 2 Characters in R Find the position of the final character and subtract 1 from it – that is the final position of the desired string. Use the substr() function to extract the desired string inclusively between the initial position and final position as found in Steps 1-2.

How do I extract part of a string in R?

The substring function in R can be used either to extract parts of character strings, or to change the values of parts of character strings. substring of a vector or column in R can be extracted using substr() function. To extract the substring of the column in R we use functions like substr() and substring().

How would you extract one particular word from a string in R?

To extract words from a string vector, we can use word function of stringr package. For example, if we have a vector called x that contains 100 words then first 20 words can be extracted by using the command word(x,start=1,end=20,sep=fixed(" ")).

What is Stringr in R?

The stringr package provide a cohesive set of functions designed to make working with strings as easy as possible. If you're not familiar with strings, the best place to start is the chapter on strings in R for Data Science.


2 Answers

You may use str_match with STR1 (.*?) STR2 (note the spaces are "meaningful", if you want to just match anything in between STR1 and STR2 use STR1(.*?)STR2, or use STR1\\s*(.*?)\\s*STR2 to trim the value you need). If you have multiple occurrences, use str_match_all.

Also, if you need to match strings that span across line breaks/newlines add (?s) at the start of the pattern: (?s)STR1(.*?)STR2 / (?s)STR1\\s*(.*?)\\s*STR2.

library(stringr) a <- " anything goes here, STR1 GET_ME STR2, anything goes here" res <- str_match(a, "STR1\\s*(.*?)\\s*STR2") res[,2] [1] "GET_ME" 

Another way using base R regexec (to get the first match):

test <- " anything goes here, STR1 GET_ME STR2, anything goes here STR1 GET_ME2 STR2" pattern <- "STR1\\s*(.*?)\\s*STR2" result <- regmatches(test, regexec(pattern, test)) result[[1]][2] [1] "GET_ME" 
like image 104
Wiktor Stribiżew Avatar answered Sep 21 '22 14:09

Wiktor Stribiżew


Here's another way by using base R

a<-" anything goes here, STR1 GET_ME STR2, anything goes here"  gsub(".*STR1 (.+) STR2.*", "\\1", a) 

Output:

[1] "GET_ME" 
like image 35
Ulises Rosas-Puchuri Avatar answered Sep 21 '22 14:09

Ulises Rosas-Puchuri