Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count the number of valid observations (no NA) pairwise in a data frame

Say I have a data frame like this:

Df <- data.frame(
    V1 = c(1,2,3,NA,5),
    V2 = c(1,2,NA,4,5),
    V3 = c(NA,2,NA,4,NA)
)

Now I want to count the number of valid observations for every combination of two variables. For that, I wrote a function sharedcount:

sharedcount <- function(x,...){
    nx <- names(x)
    alln <- combn(nx,2)
    out <- apply(alln,2,
      function(y)sum(complete.cases(x[y]))
    )
    data.frame(t(alln),out)
}

This gives the output:

> sharedcount(Df)
  X1 X2 out
1 V1 V2   3
2 V1 V3   1
3 V2 V3   2

All fine, but the function itself takes pretty long on big dataframes (600 variables and about 10000 observations). I have the feeling I'm overseeing an easier approach, especially since cor(...,use='pairwise') is running still a whole lot faster while it has to do something similar :

> require(rbenchmark)    
> benchmark(sharedcount(TestDf),cor(TestDf,use='pairwise'),
+     columns=c('test','elapsed','relative'),
+     replications=1
+ )
                           test elapsed relative
2 cor(TestDf, use = "pairwise")    0.25     1.0
1           sharedcount(TestDf)    1.90     7.6

Any tips are appreciated.


Note : Using Vincent's trick, I wrote a function that returns the same data frame. Code in my answer below.

like image 298
Joris Meys Avatar asked Feb 23 '12 12:02

Joris Meys


2 Answers

The following is slightly faster:

x <- !is.na(Df)
t(x) %*% x

#       test elapsed relative
#    cor(Df)  12.345 1.000000
# t(x) %*% x  20.736 1.679708
like image 109
Vincent Zoonekynd Avatar answered Sep 21 '22 10:09

Vincent Zoonekynd


I thought Vincent's looked really elegant, not to mention being faster than my sophomoric for-loop, except it seems to be needing an extraction step which I added below. This is just an example of the heavy overhead in the apply method when used with dataframes.

shrcnt <- function(Df) {Comb <- t(combn(1:ncol(Df),2) )
shrd <- 1:nrow(Comb)
for (i in seq_len(shrd)){ 
     shrd[i] <- sum(complete.cases(Df[,Comb[i,1]], Df[,Comb[i,2]]))}
return(shrd)}

   benchmark(
      shrcnt(Df), sharedcount(Df), {prs <- t(x) %*% x; prs[lower.tri(prs)]}, 
      cor(Df,use='pairwise'),
        columns=c('test','elapsed','relative'),
        replications=100
      )
 #--------------
                       test elapsed relative
3                         {   0.008      1.0
4 cor(Df, use = "pairwise")   0.020      2.5
2           sharedcount(Df)   0.092     11.5
1                shrcnt(Df)   0.036      4.5
like image 23
IRTFM Avatar answered Sep 21 '22 10:09

IRTFM