Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String split with conditions in R

I have this mystring with the delimiter _. The condition here is if there are two or more delimiters, I want to split at the second delimiter and if there is only one delimiter, I want to split at ".Recal" and get the result as shown below.

mystring<-c("MODY_60.2.ReCal.sort.bam","MODY_116.21_C4U.ReCal.sort.bam","MODY_116.3_C2RX-1-10.ReCal.sort.bam","MODY_116.4.ReCal.sort.bam")

result

"MODY_60.2"  "MODY_116.21" "MODY_116.3"  "MODY_116.4"
like image 560
MAPK Avatar asked Aug 11 '15 01:08

MAPK


3 Answers

You can do this using gsubfn

library(gsubfn)
f <- function(x,y,z) if (z=="_") y else strsplit(x, ".ReCal", fixed=T)[[1]][[1]]
gsubfn("([^_]+_[^_]+)(.).*", f, mystring, backref=2)
# [1] "MODY_60.2"   "MODY_116.21" "MODY_116.3"  "MODY_116.4" 

This allows for cases when you have more than two "_", and you want to split on the second one, for example,

mystring<-c("MODY_60.2.ReCal.sort.bam",
            "MODY_116.21_C4U.ReCal.sort.bam",
            "MODY_116.3_C2RX-1-10.ReCal.sort.bam",
            "MODY_116.4.ReCal.sort.bam",
            "MODY_116.4_asdfsadf_1212_asfsdf",
            "MODY_116.5.ReCal_asdfsadf_1212_asfsdf",  # split by second "_", leaving ".ReCal"
            "MODY")

gsubfn("([^_]+_[^_]+)(.).*", f, mystring, backref=2)
# [1] "MODY_60.2"        "MODY_116.21"      "MODY_116.3"       "MODY_116.4"      
# [5] "MODY_116.4"       "MODY_116.5.ReCal" "MODY"            

In the function, f, x is the original string, y and z are the next matches. So, if z is not a "_", then it proceeds with the splitting by the alternative string.

like image 134
Rorschach Avatar answered Nov 20 '22 05:11

Rorschach


With the stringr package:

str_extract(mystring, '.*?_.*?(?=_)|^.*?_.*(?=\\.ReCal)')
[1] "MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"

It also works with more than two delimiters.

like image 5
Pierre L Avatar answered Nov 20 '22 04:11

Pierre L


Perl/PCRE has the branch reset feature that lets you reuse a group number when you have capturing groups in different alternatives, and is considered as one capturing group.

IMO, this feature is elegant when you want to supply different alternatives.

x <- c('MODY_60.2.ReCal.sort.bam', 'MODY_116.21_C4U.ReCal.sort.bam', 
       'MODY_116.3_C2RX-1-10.ReCal.sort.bam', 'MODY_116.4.ReCal.sort.bam',
       'MODY_116.4_asdfsadf_1212_asfsdf', 'MODY_116.5.ReCal_asdfsadf_1212_asfsdf', 'MODY')

sub('^(?|([^_]*_[^_]*)_.*|(.*)\\.ReCal.*)$', '\\1', x, perl=T)
# [1] "MODY_60.2"        "MODY_116.21"      "MODY_116.3"       "MODY_116.4"      
# [5] "MODY_116.4"       "MODY_116.5.ReCal" "MODY"  
like image 5
hwnd Avatar answered Nov 20 '22 04:11

hwnd