I have variable called name, I want to set it as column names of my matrix, but before doing this, I need to edit names inside of my variable called name
>name
[722] "TCGA-OL-A66N-01A-12R-A31S-13.isoform.quantification.txt"
[723] "TCGA-OL-A66O-01A-11R-A31S-13.isoform.quantification.txt"
[724] "TCGA-OL-A66P-01A-11R-A31S-13.isoform.quantification.txt"
I want just keep the letters before the fourth -
Expected Output:
>name
[722] "TCGA-OL-A66N-01A"
[723] "TCGA-OL-A66O-01A"
[724] "TCGA-OL-A66P-01A"
would someone help me to implement this in R ?
The regex "[" operator defines a character class and in the character class the "^" operator in the first position does negation;
?regex
?sub
sub("^([^-]*[-][^-]*[-][^-]*[-][^-]*)([-].*$)", "\\1", name)
[1] "TCGA-OL-A66N-01A" "TCGA-OL-A66O-01A" "TCGA-OL-A66P-01A"
This would be simpler (IMO) than the str_split approach
sapply( lapply( strsplit(name, "\\-"), "[", 1:4),
# extracted the first 4 elements from each list element returned by strsplit
paste, collapse="-") # 'collapse' needed rather than 'sep'
#[1] "TCGA-OL-A66N-01A" "TCGA-OL-A66O-01A" "TCGA-OL-A66P-01A"
If the size varies/not guaranteed nchar
away you can use str_split_fixed()
from stringr
.
stringr
solution:library(stringr)
name <- c(
"TCGA-OL-A66N-01A-12R-A31S-13.isoform.quantification.txt",
"TCGA-OL-A66O-01A-11R-A31S-13.isoform.quantification.txt",
"TCGA-OL-A66P-01A-11R-A31S-13.isoform.quantification.txt")
apply(str_split_fixed(name,"-",5)[,1:4],1,paste0,collapse="-")
will give you what you:
## "TCGA-OL-A66N-01A" "TCGA-OL-A66O-01A" "TCGA-OL-A66P-01A"
str_split_fixed(name,"-",5)
split each vector element of name
into 5
pieces according to the first 5 ocurences of -
[,1:4]
retain the first 4 pieces (columns of resulting matrix) for each name
element
apply(...,1,paste0,collapse="-")
paste them together collapsing using the "-"
to restore the names (rowwise)
Here I'm comparing my stringr
+apply()
method to @BondedDust grep
method and the basic strsplit
method.
First, let's bump it up to a 10 thousand names:
name <- rep(name,3.334e3)
then a microbenchmark:
microbenchmark(
stringr_apply = apply(str_split_fixed(name,"-",5)[,1:4],1,paste0,collapse="-"),
grep_ninja = sub("^([^-]*[-][^-]*[-][^-]*[-][^-]*)([-].*$)", "\\1", name),
strsplit = sapply( lapply( strsplit(name, "\\-"), "[", 1:4), paste, collapse="-"),
times=25)
and get:
# Unit: milliseconds
# expr min lq median uq max neval
# stringr_apply 845.44542 874.5674 899.27849 941.22628 976.88903 25
# grep_ninja 25.51796 25.7066 25.85404 25.95922 27.89165 25
# strsplit 115.10626 123.2645 126.45171 130.10334 147.39517 25
seems like base
pattern matching / replacement will scale better...about a second here or 30x faster than the slowest way.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With