I had a question regarding the colMeans function. Is there a version of this that will not return an error when it runs into a column of length one? For example
temp<-cbind(c(2,2),c(3,4))
colMeans(temp)
[1] 2.0 3.5
But for this one
temp2<-c(2,2)
colMeans(temp2)
Error in colMeans(temp2) :
'x' must be an array of at least two dimensions
But, if I apply the function mean to each column it properly comes up with the value of 2 and 2.
I wrote a function to do this
testfun<-function(i,x){
mean(x[,i])
}
sapply(1:ncol(x),testfun,x)
which gives the same results as colMeans.
I've heard that colMeans is supposed to be much faster than this method. So, is there a version of colMeans that will work when my column is of size 1.
As @Paul points out, colMeans
expects "an array of two or more dimensions" for its x
argument (from ?colMeans
). But temp2
is not an array
is.array(temp2)
# [1] FALSE
temp2
can be made into an array:
(tempArray <- array(temp2, dim = c(1, 2)))
# [,1] [,2]
# [1,] 2 2
colMeans(tempArray)
# [1] 2 2
Perhaps temp2
came from subsetting an array, such as
array(temp2, dim = c(2, 2))[1, ]
But this is not an array. To keep it as an array, add drop = FALSE
inside the brackets:
array(temp2, dim = c(2, 2))[1, , drop = FALSE]
# [,1] [,2]
# [1,] 2 2
Then you can use colMeans on the subsetted array.
The colMeans
function is meant for n-dimensional arrays. When your column is of size 1 (1 colum, or 1 row??) you effectively have a vector. On a vector, using just mean
is fine. In terms of speed, calculating the mean of a million numbers is very fast:
> system.time(mean(runif(10e5)))
user system elapsed
0.038 0.000 0.038
@PaulHiemstra and @BenBarnes provide correct answers. I just want to add to their explanations.
Vectors vs. arrays
Vectors are the fundamental data structure in R. Almost everything is internally represented as a vector, even lists (with the exception of a special kind of list, the dotted pair list, see ?list
). Arrays are simply vectors with an attribute attached, the dim
attribute, which describes the object's dimensions. Consider the following:
v <- c(1:10)
a <- array(v, dim = c(5, 2))
length(v) # 10
length(a) # 10
attributes(v) # NULL
attributes(a) # $dim 10 1
is.vector(v) # TRUE
is.array(v) # FALSE
is.vector(a) # FALSE
is.array(a) # TRUE
Both v
and a
are length 10
. The only difference is a
has the dim
attribute attached to it. Because of this added attribute, R treats a
externally as an array instead of a vector. Modifying just the dim
attribute can change R's external representation of an object from array to vector and back:
attr(a, "dim") <- NULL
is.vector(a) # TRUE
is.array(a) # FALSE
attr(v, "dim") <- c(5, 2)
is.vector(v) # FALSE
is.array(v) # TRUE
In your example, temp2
is a vector object, thus lacking a dim
attribute. colMeans
is expecting an array
object with a dim
attribute of at least length 2 (two dimensional). You can easily convert temp2
to a two dimensional array with a single column:
temp3 <- array(temp2, dim = c(length(temp2), 1))
# or:
temp4 <- temp2
attr(temp4, "dim") <- c(length(temp2), 1)
is.array(temp2) # FALSE
is.array(temp3) # TRUE
is.array(temp4) # TRUE
colMeans() vs. mean()
@PaulHiemstra is right, instead of converting a vector to a single column for colMeans()
, it is much more common to just use mean()
on a vector. However, you are correct that colMeans()
is faster. I believe this is because it does a bit less checking for well-formed data, but we'd have to look at the internal C code to be sure. Consider this example:
# Create vector "v" and array "a"
n <- 10e7
set.seed(123) # Set random number seed to ensure "v" and "a[,1]" are equal
v <- runif(n)
set.seed(123) # Set random number seed to ensure "v" and "a[,1]" are equal
a <- array(runif(n), dim=c(n, 1))
# Test that "v" and "a[,1]" are equal
all.equal(v, a[,1]) # TRUE
# Functions to compare
f1 <- function(x = v){mean(x)} # Using mean on vector
f2 <- function(x = a){mean(x)} # Using mean on array
f3 <- function(x = a){colMeans(x)} # Using colMeans on array
# Compare elapsed time
system.time(f1()) # elapsed time = 0.344
system.time(f2()) # elapsed time = 0.366
system.time(f3()) # elapsed time = 0.166
colMeans()
on the array is faster than mean()
on either a vector or an array. However, most of the time this speed-up will be negligible. I find that it is more natural to just use mean()
on a vector or single-column array. But, if you are a true speed demon you might sleep better at night knowing that you are saving several hundred milliseconds of processing time by using colMeans()
on single column arrays instead.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With