I am trying to find a simple way to extract an unknown substring (could be anything) that appear between two known substrings. For example, I have a string: <code>a<-" anything goes here, STR1 GET_ME STR2, anything goes here"</code> I need to extract the string <code>GET_ME</code> which is between STR1 and STR2 (without the white spaces). I am trying <code>str_extract(a, "STR1 (.+) STR2")</code>, but I am getting the entire match <pre class="prettyprint"><code>[1] "STR1 GET_ME STR2" </code></pre> I can of course strip the known strings, to isolate the substring I need, but I think there should be a cleaner way to do it by using a correct regular expression.

You may use <code>str_match</code> with <code>STR1 (.*?) STR2</code> (note the spaces are "meaningful", if you want to just match anything in between <code>STR1</code> and <code>STR2</code> use <code>STR1(.*?)STR2</code>, or use <code>STR1\\s*(.*?)\\s*STR2</code> to trim the value you need). If you have multiple occurrences, use <code>str_match_all</code>. Also, if you need to match strings that span across line breaks/newlines add <code>(?s)</code> at the start of the pattern: <code>(?s)STR1(.*?)STR2</code> / <code>(?s)STR1\\s*(.*?)\\s*STR2</code>. <pre class="prettyprint"><code>library(stringr) a <- " anything goes here, STR1 GET_ME STR2, anything goes here" res <- str_match(a, "STR1\\s*(.*?)\\s*STR2") res[,2] [1] "GET_ME" </code></pre> Another way using base R <code>regexec</code> (to get the first match): <pre class="prettyprint"><code>test <- " anything goes here, STR1 GET_ME STR2, anything goes here STR1 GET_ME2 STR2" pattern <- "STR1\\s*(.*?)\\s*STR2" result <- regmatches(test, regexec(pattern, test)) result[[1]][2] [1] "GET_ME" </code></pre>

Extracting a string between other two strings in R

Q: How do I extract text from two characters in R?

Extracting a String Between 2 Characters in R Find the position of the final character and subtract 1 from it – that is the final position of the desired string. Use the substr() function to extract the desired string inclusively between the initial position and final position as found in Steps 1-2.

Q: How do I extract part of a string in R?

The substring function in R can be used either to extract parts of character strings, or to change the values of parts of character strings. substring of a vector or column in R can be extracted using substr() function. To extract the substring of the column in R we use functions like substr() and substring().

Q: How would you extract one particular word from a string in R?

To extract words from a string vector, we can use word function of stringr package. For example, if we have a vector called x that contains 100 words then first 20 words can be extracted by using the command word(x,start=1,end=20,sep=fixed(" ")).

Q: What is Stringr in R?

The stringr package provide a cohesive set of functions designed to make working with strings as easy as possible. If you're not familiar with strings, the best place to start is the chapter on strings in R for Data Science.

Tags:

regex

r

stringr

I am trying to find a simple way to extract an unknown substring (could be anything) that appear between two known substrings. For example, I have a string:

a<-" anything goes here, STR1 GET_ME STR2, anything goes here"

I need to extract the string GET_ME which is between STR1 and STR2 (without the white spaces).

I am trying str_extract(a, "STR1 (.+) STR2"), but I am getting the entire match

[1] "STR1 GET_ME STR2"

I can of course strip the known strings, to isolate the substring I need, but I think there should be a cleaner way to do it by using a correct regular expression.

843

asked Aug 22 '16 18:08

Sasha

2 Answers

You may use str_match with STR1 (.*?) STR2 (note the spaces are "meaningful", if you want to just match anything in between STR1 and STR2 use STR1(.*?)STR2, or use STR1\\s*(.*?)\\s*STR2 to trim the value you need). If you have multiple occurrences, use str_match_all.

Also, if you need to match strings that span across line breaks/newlines add (?s) at the start of the pattern: (?s)STR1(.*?)STR2 / (?s)STR1\\s*(.*?)\\s*STR2.

library(stringr) a <- " anything goes here, STR1 GET_ME STR2, anything goes here" res <- str_match(a, "STR1\\s*(.*?)\\s*STR2") res[,2] [1] "GET_ME"

Another way using base R regexec (to get the first match):

test <- " anything goes here, STR1 GET_ME STR2, anything goes here STR1 GET_ME2 STR2" pattern <- "STR1\\s*(.*?)\\s*STR2" result <- regmatches(test, regexec(pattern, test)) result[[1]][2] [1] "GET_ME"

104

answered Sep 21 '22 14:09

Wiktor Stribiżew

Here's another way by using base R

a<-" anything goes here, STR1 GET_ME STR2, anything goes here"  gsub(".*STR1 (.+) STR2.*", "\\1", a)

Output:

[1] "GET_ME"

answered Sep 21 '22 14:09

Ulises Rosas-Puchuri

Related questions
                            
                                Using explicitly numbered repetition instead of question mark, star and plus
                            
                                Vim regex backreference
                            
                                Phone validation regex
                            
                                Sed expression doesn't allow optional grouped string
                            
                                How to pass a variable into regex in jQuery/Javascript
                            
                                Replace only first match using preg_replace [duplicate]
                            
                                How to represent a fix number of repeats in regular expression?
                            
                                C# RegEx string extraction
                            
                                What is the regex to extract all the emojis from a string?
                            
                                replace all occurrences in a string [duplicate]
                            
                                Python: How to use RegEx in an if statement?
                            
                                Notepad++ non-greedy regular expressions
                            
                                Using PHP Replace SPACES in URLS with %20
                            
                                Regex that matches integers in between whitespace or start/end of string only
                            
                                Meaning of "=~" operator in shell script [duplicate]
                            
                                Regular expression: zero or more occurrences of optional character /
                            
                                In regular expressions, what is a backtracking / back referencing?
                            
                                Difference between egrep and grep
                            
                                How to match once per file in grep?
                            
                                Any character including newline - Java Regex

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With