Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R- regex extracting a string between a dash and a period

Tags:

regex

r

First of all I apologize if this question is too naive or has been repeated earlier. I tried to find it in the forum but I'm posting it as a question because I failed to find an answer.

I have a data frame with column names as follows;

head(rownames(u))

[1] "A17-R-Null-C-3.AT2G41240"       "A18-R-Null-C-3.AT2G41240"         "B19-R-Null-C-3.AT2G41240"      
[4] "B20-R-Null-C-3.AT2G41240"       "A21-R-Transgenic-C-3.AT2G41240" "A22-R-Transgenic-C-3.AT2G41240"

What I want is to use regex in R to extract the string in between the first dash and the last period.

Anticipated results are,

[1] "R-Null-C-3"       "R-Null-C-3"         "R-Null-C-3"      
[4] "R-Null-C-3"       "R-Transgenic-C-3" "R-Transgenic-C-3"

I tried following with no luck...

gsub("^[^-]*-|.+\\.","\\2", rownames(u))
gsub("^.+-","", rownames(u))
sub("^[^-]*.|\\..","", rownames(u))

Would someone be able to help me with this problem?

Thanks a lot in advance.

Shani.

like image 278
Shani A. Avatar asked Dec 25 '22 09:12

Shani A.


1 Answers

Here is a solution to be used with gsub:

v <- c("A17-R-Null-C-3.AT2G41240", "A18-R-Null-C-3.AT2G41240", "B19-R-Null-C-3.AT2G41240", "B20-R-Null-C-3.AT2G41240", "A21-R-Transgenic-C-3.AT2G41240", "A22-R-Transgenic-C-3.AT2G41240")
gsub("^[^-]*-([^.]+).*", "\\1", v)

See IDEONE demo

The regex matches:

  • ^[^-]* - zero or more characters other than -
  • - - a hyphen
  • ([^.]+) - Group 1 matching and capturing one or more characters other than a dot
  • .* - any characters (even including a newline since perl=T is not used), any number of occurrences up to the end of the string.
like image 107
Wiktor Stribiżew Avatar answered Jan 09 '23 02:01

Wiktor Stribiżew