Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

getting column means for non zero data

Tags:

r

Can R get the colMeans for the non-zero values of a data frame?

data<-data.frame(col1=c(1,0,1,0,3,3),col2=c(5,0,5,0,7,7))
colMeans(data)   # 1.33,4

I would like something like:

mean(data$col1[data$col1>0]) # 2
mean(data$col2[data$col2>0]) # 6

Thanks in advance: D


Benchmarks of Solutions:
n <- 2E4
m <- 1E3
data <- matrix(runif(n*m),nrow = n)

system.time (col_means <- colSums(data)/colSums(!!data) ) 
#   user  system elapsed 
# 0.087   0.007   0.094 

system.time (   colMeans(NA^(data==0)*data, na.rm=TRUE)) 
#   user  system elapsed 
#  0.167   0.084   0.251 

system.time (vapply(data, function(x) mean(x[x!=0]), numeric(1))) 
#   user  system elapsed 
#126.519   0.737 127.715 

library(dplyr)
system.time (summarise_each(data, funs(mean(.[.!=0])))) # Gave error
like image 774
HowYaDoing Avatar asked Aug 03 '15 15:08

HowYaDoing


2 Answers

You can use colSums on both the data and it's "logical representation" to divide the column sums by the number of non-zero elements for each column:

colSums(data)/colSums(!!data)
col1 col2 
   2    6 
like image 67
James Avatar answered Nov 15 '22 08:11

James


You could change the 0 to NA and then use colMeans as it has an option for na.rm=TRUE. In a two step process, we convert the data elements that are '0' to 'NA', and then get the colMeans excluding the NA elements.

  is.na(data) <- data==0
  colMeans(data, na.rm=TRUE) 
  #   col1 col2 
  #    2    6 

If you need that in a single step, we can change the logical matrix (data==0) to NA and 1 by doing (NA^) for values corresponding to '0' and non-zero elements and then multiply with original data so that 1 value change to the element in that position and NA remains as such. We can do colMeans on that output as above.

   colMeans(NA^(data==0)*data, na.rm=TRUE)
   #  col1 col2 
   #   2    6 

Another option is using sapply/vapply. If the dataset is really big, converting to a matrix may not be a good idea as it may cause issues with memory. By looping through the columns either with sapply or a more specific vapply (would be a bit more fast), we get the mean of the non-zero elements.

 vapply(data, function(x) mean(x[x!=0]), numeric(1))
 #  col1 col2 
 #  2    6 

Or we can use summarise_each and specify the function inside the funs after subsetting the non-zero elements.

 library(dplyr)
 summarise_each(data, funs(mean(.[.!=0])))
 #  col1 col2
 #1    2    6
like image 40
akrun Avatar answered Nov 15 '22 08:11

akrun