Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently create dataframe from strings containing key-value pairs

I would like to ask you for efficiency suggestions for a specific coding problem in R. I have a string vector in the following style:

[1] "HGVSc=ENST00000495576.1:n.820-1G>A;INTRON=1/1;CANONICAL=YES"
[2] "DISTANCE=2179"                                              
[3] "HGVSc=ENST00000466430.1:n.911C>T;EXON=4/4;CANONICAL=YES"    
[4] "DISTANCE=27;CANONICAL=YES;common"

In each element of the vector, the single entries are separated with a ; and MOST of the single entries have the format KEY=VALUE. However, there are also some entries, which only have the format KEY (see "common" in [4]). In this example, there are 15 different keys and not every key appears in each element of the vector. The 15 different keys are:

names <- c('ENSP','HGVS','DOMAINS','EXON','INTRON', 'HGVSp', 'HGVSc','CANONICAL','GMAF','DISTANCE', 'HGNC', 'CCDS', 'SIFT', 'PolyPhen', 'common')

From this vector I would like to create a dataframe that looks like this:

ENSP HGVS DOMAINS EXON INTRON HGVSp                        HGVSc CANONICAL
1    -    -       -    -    1/1     - ENST00000495576.1:n.820-1G>A       YES
2    -    -       -    -      -     -                            -         -
3    -    -       -  4/4      -     -   ENST00000466430.1:n.911C>T       YES
4    -    -       -    -      -     -                            -       YES
GMAF DISTANCE HGNC CCDS SIFT PolyPhen common
1    -        -    -    -    -        -      -
2    -     2179    -    -    -        -      -
3    -        -    -    -    -        -      -
4    -       27    -    -    -        -    YES

I wrote this function to solve the problem:

unlist.info <- function(names, column){
  info.mat <- matrix(rep('-', length(column)*length(names)), nrow=length(column), ncol=length(names), dimnames=list(c(), names))
  info.mat <- as.data.frame(info.mat, stringsAsFactors=F)

  for (i in 1:length(column)){
    info <- unlist(strsplit(column[i], "\\;"))
    for (e in info){
      e <- unlist(strsplit(e, "\\="))
      j <- which(names == e[1])
      if (length(e) > 1){
        # KEY=VALUE. The value might contain a = as well
        value <- paste(e[2:length(e)], collapse='=')
        info.mat[i,j] <- value
      }else{
        # only KEY
        info.mat[i,j] <- 'YES'
      }
    }
  }
  return(info.mat)
}

And then I call:

mat <- unlist.info(names, vector)

Even though this works, it is really slow. Also I am handling vectors with over 100.000 entries. Now I realize that looping is inelegant and inefficient in R and I am familiar with the concept of applying functions to data frames. However, since every entry of the vector contains a different subset of KEY=VALUE or KEY entries I could not come up with a more efficient function.

like image 589
Eva König Avatar asked Oct 16 '12 11:10

Eva König


2 Answers

Here you go:

Recreate the data:

x <- c(
  "HGVSc=ENST00000495576.1:n.820-1G>A;INTRON=1//1;CANONICAL=YES",
  "DISTANCE=2179",
  "HGVSc=ENST00000466430.1:n.911C>T;EXON=4//4;CANONICAL=YES",
  "DISTANCE=27;CANONICAL=YES;common"
)

Create a named vector with your desired names. This is used for fast lookup later:

names <- setNames(1:15, c('ENSP','HGVS','DOMAINS','EXON','INTRON', 'HGVSp', 'HGVSc','CANONICAL','GMAF','DISTANCE', 'HGNC', 'CCDS', 'SIFT', 'PolyPhen', 'common'))

Create a helper function that assigns each variable to the correct position in a matrix. Then use lapply and strsplit:

assign <- function(x, names){
  xx <- sapply(x, function(i)if(length(i)==2L) i else c(i, "YES"))
  z <- rep(NA, length(names))
  z[names[xx[1, ]]] <- xx[2, ]
  z
}

sx <- lapply(strsplit(x, ";"), strsplit, "=")
ret <- t(sapply(sx, assign, names))
colnames(ret) <- names(names)
ret

The results:

     ENSP HGVS DOMAINS EXON   INTRON HGVSp HGVSc                          CANONICAL GMAF DISTANCE HGNC
[1,] NA   NA   NA      NA     "1//1" NA    "ENST00000495576.1:n.820-1G>A" "YES"     NA   NA       NA  
[2,] NA   NA   NA      NA     NA     NA    NA                             NA        NA   "2179"   NA  
[3,] NA   NA   NA      "4//4" NA     NA    "ENST00000466430.1:n.911C>T"   "YES"     NA   NA       NA  
[4,] NA   NA   NA      NA     NA     NA    NA                             "YES"     NA   "27"     NA  
     CCDS SIFT PolyPhen common
[1,] NA   NA   NA       NA    
[2,] NA   NA   NA       NA    
[3,] NA   NA   NA       NA    
[4,] NA   NA   NA       "YES" 
like image 98
Andrie Avatar answered Sep 19 '22 16:09

Andrie


Here's another, faster, solution taking advantage of the original pairings...

##                   test elapsed replications relative average
## 2    thell_solution(x)    0.37         1000    1.000 0.00037
## 3   andrie_solution(x)    1.04         1000    2.811 0.00104
## 1 original_solution(x)    2.61         1000    7.054 0.00261

Since pairing[1] always gets assigned pairing[2] except with the final bool (… not that I understand why that one flag is treated differently in the original string vector …) we can take advantage of the sequence and the fact that the vector will assign NA when a name is given without a value ( ie: x[5] == NA ) and we also have no need to call names multiple times. And since strsplit uses regex we can do alternation.

# Let `x` be as @Andrie made it in his answer.  Let `names` be as you had
# in the original question.

# A pre-built dummy record and empty list.
na.record <- setNames(rep(NA, time = length(names)), names)
y <- list()

do.call(rbind, lapply(strsplit(x, "(;|=)"), FUN = function(x) {
    x_seq <- seq.int(to = length(x), by = 2)
    y[x[x_seq]] <- x[x_seq + 1]
    y[is.na(y)] <- "YES"
    na.record[x[x_seq]] <- y
    na.record
}))


##      ENSP HGVS DOMAINS EXON   INTRON HGVSp HGVSc                         
## [1,] NA   NA   NA      NA     "1//1" NA    "ENST00000495576.1:n.820-1G>A"
## [2,] NA   NA   NA      NA     NA     NA    NA                            
## [3,] NA   NA   NA      "4//4" NA     NA    "ENST00000466430.1:n.911C>T"  
## [4,] NA   NA   NA      NA     NA     NA    NA                            
##      CANONICAL GMAF DISTANCE HGNC CCDS SIFT PolyPhen common
## [1,] "YES"     NA   NA       NA   NA   NA   NA       NA    
## [2,] NA        NA   "2179"   NA   NA   NA   NA       NA    
## [3,] "YES"     NA   NA       NA   NA   NA   NA       NA    
## [4,] "YES"     NA   "27"     NA   NA   NA   NA       "YES"
like image 3
Thell Avatar answered Sep 23 '22 16:09

Thell