Can R get the colMeans for the non-zero values of a data frame?
data<-data.frame(col1=c(1,0,1,0,3,3),col2=c(5,0,5,0,7,7))
colMeans(data) # 1.33,4
I would like something like:
mean(data$col1[data$col1>0]) # 2
mean(data$col2[data$col2>0]) # 6
Thanks in advance: D
n <- 2E4
m <- 1E3
data <- matrix(runif(n*m),nrow = n)
system.time (col_means <- colSums(data)/colSums(!!data) )
# user system elapsed
# 0.087 0.007 0.094
system.time ( colMeans(NA^(data==0)*data, na.rm=TRUE))
# user system elapsed
# 0.167 0.084 0.251
system.time (vapply(data, function(x) mean(x[x!=0]), numeric(1)))
# user system elapsed
#126.519 0.737 127.715
library(dplyr)
system.time (summarise_each(data, funs(mean(.[.!=0])))) # Gave error
You can use colSums
on both the data and it's "logical representation" to divide the column sums by the number of non-zero elements for each column:
colSums(data)/colSums(!!data)
col1 col2
2 6
You could change the 0
to NA
and then use colMeans
as it has an option for na.rm=TRUE
. In a two step process, we convert the data elements that are '0' to 'NA', and then get the colMeans
excluding the NA
elements.
is.na(data) <- data==0
colMeans(data, na.rm=TRUE)
# col1 col2
# 2 6
If you need that in a single step, we can change the logical matrix (data==0
) to NA
and 1 by doing (NA^
) for values corresponding to '0' and non-zero elements and then multiply with original data so that 1 value change to the element in that position and NA remains as such. We can do colMeans
on that output as above.
colMeans(NA^(data==0)*data, na.rm=TRUE)
# col1 col2
# 2 6
Another option is using sapply/vapply
. If the dataset is really big, converting to a matrix
may not be a good idea as it may cause issues with memory. By looping through the columns either with sapply
or a more specific vapply
(would be a bit more fast), we get the mean
of the non-zero elements.
vapply(data, function(x) mean(x[x!=0]), numeric(1))
# col1 col2
# 2 6
Or we can use summarise_each
and specify the function inside the funs
after subsetting the non-zero elements.
library(dplyr)
summarise_each(data, funs(mean(.[.!=0])))
# col1 col2
#1 2 6
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With