I would like to ask you for efficiency suggestions for a specific coding problem in R. I have a string vector in the following style:
[1] "HGVSc=ENST00000495576.1:n.820-1G>A;INTRON=1/1;CANONICAL=YES"
[2] "DISTANCE=2179"
[3] "HGVSc=ENST00000466430.1:n.911C>T;EXON=4/4;CANONICAL=YES"
[4] "DISTANCE=27;CANONICAL=YES;common"
In each element of the vector, the single entries are separated with a ;
and MOST of the single entries have the format KEY=VALUE
. However, there are also some entries, which only have the format KEY
(see "common" in [4]). In this example, there are 15 different keys and not every key appears in each element of the vector. The 15 different keys are:
names <- c('ENSP','HGVS','DOMAINS','EXON','INTRON', 'HGVSp', 'HGVSc','CANONICAL','GMAF','DISTANCE', 'HGNC', 'CCDS', 'SIFT', 'PolyPhen', 'common')
From this vector I would like to create a dataframe that looks like this:
ENSP HGVS DOMAINS EXON INTRON HGVSp HGVSc CANONICAL
1 - - - - 1/1 - ENST00000495576.1:n.820-1G>A YES
2 - - - - - - - -
3 - - - 4/4 - - ENST00000466430.1:n.911C>T YES
4 - - - - - - - YES
GMAF DISTANCE HGNC CCDS SIFT PolyPhen common
1 - - - - - - -
2 - 2179 - - - - -
3 - - - - - - -
4 - 27 - - - - YES
I wrote this function to solve the problem:
unlist.info <- function(names, column){
info.mat <- matrix(rep('-', length(column)*length(names)), nrow=length(column), ncol=length(names), dimnames=list(c(), names))
info.mat <- as.data.frame(info.mat, stringsAsFactors=F)
for (i in 1:length(column)){
info <- unlist(strsplit(column[i], "\\;"))
for (e in info){
e <- unlist(strsplit(e, "\\="))
j <- which(names == e[1])
if (length(e) > 1){
# KEY=VALUE. The value might contain a = as well
value <- paste(e[2:length(e)], collapse='=')
info.mat[i,j] <- value
}else{
# only KEY
info.mat[i,j] <- 'YES'
}
}
}
return(info.mat)
}
And then I call:
mat <- unlist.info(names, vector)
Even though this works, it is really slow. Also I am handling vectors with over 100.000 entries. Now I realize that looping is inelegant and inefficient in R and I am familiar with the concept of applying functions to data frames. However, since every entry of the vector contains a different subset of KEY=VALUE
or KEY
entries I could not come up with a more efficient function.
Here you go:
Recreate the data:
x <- c(
"HGVSc=ENST00000495576.1:n.820-1G>A;INTRON=1//1;CANONICAL=YES",
"DISTANCE=2179",
"HGVSc=ENST00000466430.1:n.911C>T;EXON=4//4;CANONICAL=YES",
"DISTANCE=27;CANONICAL=YES;common"
)
Create a named vector with your desired names. This is used for fast lookup later:
names <- setNames(1:15, c('ENSP','HGVS','DOMAINS','EXON','INTRON', 'HGVSp', 'HGVSc','CANONICAL','GMAF','DISTANCE', 'HGNC', 'CCDS', 'SIFT', 'PolyPhen', 'common'))
Create a helper function that assigns each variable to the correct position in a matrix. Then use lapply
and strsplit
:
assign <- function(x, names){
xx <- sapply(x, function(i)if(length(i)==2L) i else c(i, "YES"))
z <- rep(NA, length(names))
z[names[xx[1, ]]] <- xx[2, ]
z
}
sx <- lapply(strsplit(x, ";"), strsplit, "=")
ret <- t(sapply(sx, assign, names))
colnames(ret) <- names(names)
ret
The results:
ENSP HGVS DOMAINS EXON INTRON HGVSp HGVSc CANONICAL GMAF DISTANCE HGNC
[1,] NA NA NA NA "1//1" NA "ENST00000495576.1:n.820-1G>A" "YES" NA NA NA
[2,] NA NA NA NA NA NA NA NA NA "2179" NA
[3,] NA NA NA "4//4" NA NA "ENST00000466430.1:n.911C>T" "YES" NA NA NA
[4,] NA NA NA NA NA NA NA "YES" NA "27" NA
CCDS SIFT PolyPhen common
[1,] NA NA NA NA
[2,] NA NA NA NA
[3,] NA NA NA NA
[4,] NA NA NA "YES"
Here's another, faster, solution taking advantage of the original pairings...
## test elapsed replications relative average
## 2 thell_solution(x) 0.37 1000 1.000 0.00037
## 3 andrie_solution(x) 1.04 1000 2.811 0.00104
## 1 original_solution(x) 2.61 1000 7.054 0.00261
Since pairing[1] always gets assigned pairing[2] except with the final bool (… not that I understand why that one flag is treated differently in the original string vector …) we can take advantage of the sequence and the fact that the vector will assign NA when a name is given without a value ( ie: x[5] == NA ) and we also have no need to call names multiple times. And since strsplit uses regex we can do alternation.
# Let `x` be as @Andrie made it in his answer. Let `names` be as you had
# in the original question.
# A pre-built dummy record and empty list.
na.record <- setNames(rep(NA, time = length(names)), names)
y <- list()
do.call(rbind, lapply(strsplit(x, "(;|=)"), FUN = function(x) {
x_seq <- seq.int(to = length(x), by = 2)
y[x[x_seq]] <- x[x_seq + 1]
y[is.na(y)] <- "YES"
na.record[x[x_seq]] <- y
na.record
}))
## ENSP HGVS DOMAINS EXON INTRON HGVSp HGVSc
## [1,] NA NA NA NA "1//1" NA "ENST00000495576.1:n.820-1G>A"
## [2,] NA NA NA NA NA NA NA
## [3,] NA NA NA "4//4" NA NA "ENST00000466430.1:n.911C>T"
## [4,] NA NA NA NA NA NA NA
## CANONICAL GMAF DISTANCE HGNC CCDS SIFT PolyPhen common
## [1,] "YES" NA NA NA NA NA NA NA
## [2,] NA NA "2179" NA NA NA NA NA
## [3,] "YES" NA NA NA NA NA NA NA
## [4,] "YES" NA "27" NA NA NA NA "YES"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With