Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract all substrings in string

I want to extract all substrings that begin with M and are terminated by a *

The string below as an example;

vec<-c("SHVANSGYMGMTPRLGLESLLE*A*MIRVASQ")

Would ideally return;

MGMTPRLGLESLLE
MTPRLGLESLLE

I have tried the code below;

regmatches(vec, gregexpr('(?<=M).*?(?=\\*)', vec, perl=T))[[1]]

but this drops the first M and only returns the first string rather than all substrings within.

"GMTPRLGLESLLE"
like image 934
Nosey Avatar asked Aug 31 '25 20:08

Nosey


1 Answers

You can use

(?=(M[^*]*)\*)

See the regex demo. Details:

  • (?= - start of a positive lookahead that matches a location that is immediately followed with:
  • (M[^*]*) - Group 1: M, zero or more chars other than a * char
  • \* - a * char
  • ) - end of the lookahead.

See the R demo:

library(stringr)
vec <- c("SHVANSGYMGMTPRLGLESLLE*A*MIRVASQ")
matches <- stringr::str_match_all(vec, "(?=(M[^*]*)\\*)")
unlist(lapply(matches, function(z) z[,2]))
## => [1] "MGMTPRLGLESLLE" "MTPRLGLESLLE" 

If you prefer a base R solution:

vec <- c("SHVANSGYMGMTPRLGLESLLE*A*MIRVASQ")
matches <- regmatches(vec, gregexec("(?=(M[^*]*)\\*)", vec, perl=TRUE))
unlist(lapply(matches, tail, -1))
## => [1] "MGMTPRLGLESLLE" "MTPRLGLESLLE"
like image 118
Wiktor Stribiżew Avatar answered Sep 05 '25 05:09

Wiktor Stribiżew