Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

colMeans function in R and running into problems with columns of size 1

Tags:

r

I had a question regarding the colMeans function. Is there a version of this that will not return an error when it runs into a column of length one? For example

temp<-cbind(c(2,2),c(3,4))
colMeans(temp)

[1] 2.0 3.5

But for this one

temp2<-c(2,2)
colMeans(temp2)

Error in colMeans(temp2) : 
'x' must be an array of at least two dimensions

But, if I apply the function mean to each column it properly comes up with the value of 2 and 2.

I wrote a function to do this

testfun<-function(i,x){
mean(x[,i])
}
sapply(1:ncol(x),testfun,x)

which gives the same results as colMeans.
I've heard that colMeans is supposed to be much faster than this method. So, is there a version of colMeans that will work when my column is of size 1.

like image 637
doggysaywhat Avatar asked May 16 '12 09:05

doggysaywhat


3 Answers

As @Paul points out, colMeans expects "an array of two or more dimensions" for its x argument (from ?colMeans). But temp2 is not an array

is.array(temp2)
# [1] FALSE

temp2 can be made into an array:

(tempArray <- array(temp2, dim = c(1, 2)))
#      [,1] [,2]
# [1,]    2    2

colMeans(tempArray)
# [1] 2 2

Perhaps temp2 came from subsetting an array, such as

array(temp2, dim = c(2, 2))[1, ]

But this is not an array. To keep it as an array, add drop = FALSE inside the brackets:

array(temp2, dim = c(2, 2))[1, , drop = FALSE]
#      [,1] [,2]
# [1,]    2    2

Then you can use colMeans on the subsetted array.

like image 163
BenBarnes Avatar answered Nov 10 '22 20:11

BenBarnes


The colMeans function is meant for n-dimensional arrays. When your column is of size 1 (1 colum, or 1 row??) you effectively have a vector. On a vector, using just mean is fine. In terms of speed, calculating the mean of a million numbers is very fast:

> system.time(mean(runif(10e5)))
   user  system elapsed 
  0.038   0.000   0.038 
like image 42
Paul Hiemstra Avatar answered Nov 10 '22 21:11

Paul Hiemstra


@PaulHiemstra and @BenBarnes provide correct answers. I just want to add to their explanations.

Vectors vs. arrays

Vectors are the fundamental data structure in R. Almost everything is internally represented as a vector, even lists (with the exception of a special kind of list, the dotted pair list, see ?list). Arrays are simply vectors with an attribute attached, the dim attribute, which describes the object's dimensions. Consider the following:

v <- c(1:10)
a <- array(v, dim = c(5, 2))
length(v) # 10
length(a) # 10
attributes(v) # NULL
attributes(a) # $dim 10  1
is.vector(v) # TRUE
is.array(v) # FALSE
is.vector(a) # FALSE
is.array(a) # TRUE

Both v and a are length 10. The only difference is a has the dim attribute attached to it. Because of this added attribute, R treats a externally as an array instead of a vector. Modifying just the dim attribute can change R's external representation of an object from array to vector and back:

attr(a, "dim") <- NULL
is.vector(a) # TRUE
is.array(a) # FALSE
attr(v, "dim") <- c(5, 2)
is.vector(v) # FALSE
is.array(v) # TRUE

In your example, temp2 is a vector object, thus lacking a dim attribute. colMeans is expecting an array object with a dim attribute of at least length 2 (two dimensional). You can easily convert temp2 to a two dimensional array with a single column:

temp3 <- array(temp2, dim = c(length(temp2), 1)) 
# or:
temp4 <- temp2
attr(temp4, "dim") <- c(length(temp2), 1)
is.array(temp2) # FALSE
is.array(temp3) # TRUE
is.array(temp4) # TRUE

colMeans() vs. mean()

@PaulHiemstra is right, instead of converting a vector to a single column for colMeans(), it is much more common to just use mean() on a vector. However, you are correct that colMeans() is faster. I believe this is because it does a bit less checking for well-formed data, but we'd have to look at the internal C code to be sure. Consider this example:

# Create vector "v" and array "a"
n <- 10e7
set.seed(123) # Set random number seed to ensure "v" and "a[,1]" are equal
v <- runif(n)
set.seed(123) # Set random number seed to ensure "v" and "a[,1]" are equal
a <- array(runif(n), dim=c(n, 1))

# Test that "v" and "a[,1]" are equal
all.equal(v, a[,1]) # TRUE

# Functions to compare
f1 <- function(x = v){mean(x)} # Using mean on vector
f2 <- function(x = a){mean(x)} # Using mean on array
f3 <- function(x = a){colMeans(x)} # Using colMeans on array

# Compare elapsed time
system.time(f1()) # elapsed time = 0.344
system.time(f2()) # elapsed time = 0.366
system.time(f3()) # elapsed time = 0.166

colMeans() on the array is faster than mean() on either a vector or an array. However, most of the time this speed-up will be negligible. I find that it is more natural to just use mean() on a vector or single-column array. But, if you are a true speed demon you might sleep better at night knowing that you are saving several hundred milliseconds of processing time by using colMeans() on single column arrays instead.

like image 38
jthetzel Avatar answered Nov 10 '22 22:11

jthetzel