Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract portion of string startswith 4 digit number and ends with period

Tags:

regex

r

I have a character vector like the following:

char <- c("cancer_6_53_7575_tumor.csv", "control_7_4_7363_healthy.csv")

I want to extract the portion of the string starting with the "7" in the 4 digit patient ID and ending with the ".", but the following method doesn't work when there is a 7 before that patient ID.

values <- unlist(qdapRegex::rm_between(char, "7", ".", extract = TRUE))

How do I specify that it must start with the 7 in the 4 digit number?

like image 483
Jack Arnestad Avatar asked Feb 03 '18 20:02

Jack Arnestad


2 Answers

You can use this:

char <- c("cancer_6_53_7575_tumor.csv", "control_7_4_7363_healthy.csv")
gsub(".*(7\\d{3}.*)\\..*$", "\\1", char)
[1] "7575_tumor" "7363_healthy"
  1. It searches for a 3 digit string after 7 (makes it 4 digit string): 7\\d{3}
  2. And starts to record pattern till first . : (7\\d{3}.*)\\.
  3. Then it prints recorded pattern: \\1
like image 104
pogibas Avatar answered Oct 19 '22 18:10

pogibas


Another way is to use stringr.

library(stringr)
str_extract(char, '7\\d{3}[^\\.]*')
## [1] "7575_tumor"   "7363_healthy"

It will match 4 numbers starting with 7 and everything until the dot - ..

like image 3
m0nhawk Avatar answered Oct 19 '22 20:10

m0nhawk