Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to edit colnames in R?

Tags:

regex

r

I have variable called name, I want to set it as column names of my matrix, but before doing this, I need to edit names inside of my variable called name

>name
[722] "TCGA-OL-A66N-01A-12R-A31S-13.isoform.quantification.txt"
[723] "TCGA-OL-A66O-01A-11R-A31S-13.isoform.quantification.txt"
[724] "TCGA-OL-A66P-01A-11R-A31S-13.isoform.quantification.txt"

I want just keep the letters before the fourth -

Expected Output:

  >name
    [722] "TCGA-OL-A66N-01A"
    [723] "TCGA-OL-A66O-01A"
    [724] "TCGA-OL-A66P-01A"

would someone help me to implement this in R ?

like image 834
user2806363 Avatar asked May 28 '14 17:05

user2806363


2 Answers

The regex "[" operator defines a character class and in the character class the "^" operator in the first position does negation;

?regex
?sub

sub("^([^-]*[-][^-]*[-][^-]*[-][^-]*)([-].*$)", "\\1", name)
[1] "TCGA-OL-A66N-01A" "TCGA-OL-A66O-01A" "TCGA-OL-A66P-01A"

This would be simpler (IMO) than the str_split approach

 sapply( lapply( strsplit(name, "\\-"), "[", 1:4),   
                # extracted the first 4 elements from each list element returned by strsplit
         paste, collapse="-")  # 'collapse' needed rather than 'sep'

#[1] "TCGA-OL-A66N-01A" "TCGA-OL-A66O-01A" "TCGA-OL-A66P-01A"
like image 190
IRTFM Avatar answered Sep 27 '22 22:09

IRTFM


If the size varies/not guaranteed nchar away you can use str_split_fixed() from stringr.

stringr solution:

library(stringr)

name <- c(
    "TCGA-OL-A66N-01A-12R-A31S-13.isoform.quantification.txt",
    "TCGA-OL-A66O-01A-11R-A31S-13.isoform.quantification.txt",
    "TCGA-OL-A66P-01A-11R-A31S-13.isoform.quantification.txt")

apply(str_split_fixed(name,"-",5)[,1:4],1,paste0,collapse="-")

will give you what you:

## "TCGA-OL-A66N-01A" "TCGA-OL-A66O-01A" "TCGA-OL-A66P-01A"

explanation:

  • str_split_fixed(name,"-",5)

split each vector element of name into 5 pieces according to the first 5 ocurences of -

  • [,1:4]

retain the first 4 pieces (columns of resulting matrix) for each name element

  • apply(...,1,paste0,collapse="-")

paste them together collapsing using the "-" to restore the names (rowwise)


but what if i have many many names?

Here I'm comparing my stringr+apply() method to @BondedDust grep method and the basic strsplit method.

First, let's bump it up to a 10 thousand names:

name <- rep(name,3.334e3)

then a microbenchmark:

microbenchmark(
  stringr_apply = apply(str_split_fixed(name,"-",5)[,1:4],1,paste0,collapse="-"),
  grep_ninja = sub("^([^-]*[-][^-]*[-][^-]*[-][^-]*)([-].*$)", "\\1", name),
  strsplit = sapply( lapply( strsplit(name, "\\-"), "[", 1:4), paste, collapse="-"), 
  times=25)

and get:

#  Unit: milliseconds
#  expr             min       lq    median        uq       max    neval
# stringr_apply 845.44542 874.5674 899.27849 941.22628 976.88903    25
# grep_ninja     25.51796  25.7066  25.85404  25.95922  27.89165    25
# strsplit      115.10626 123.2645 126.45171 130.10334 147.39517    25

seems like base pattern matching / replacement will scale better...about a second here or 30x faster than the slowest way.

like image 22
npjc Avatar answered Sep 27 '22 22:09

npjc