Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

why does sd in R return a vector for matrix input, and what can I do about it?

Tags:

r

I am somewhat confused as to why the sd function in R returns an array for matrix input (I suppose to maintain backwards compatibility, it always will). This is very odd behaviour to me:

#3d input, same same
print(length(mean(array(rnorm(60),dim=c(3,4,5)))))
print(length(sd(array(rnorm(60),dim=c(3,4,5)))))
#1d input, same same
print(length(mean(array(rnorm(60),dim=c(60)))))
print(length(sd(array(rnorm(60),dim=c(60)))))
#2d input, different!
print(length(mean(array(rnorm(60),dim=c(12,5)))))
print(length(sd(array(rnorm(60),dim=c(12,5)))))

I get

[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
[1] 5

That is sd behaves differently from mean when the input is a 2-d array (and apparently only in that case!) Consider then, this failed function to rescale each column of a k-dimensional array by the standard deviation:

re.scale <- function(x) {
    #rescale by the standard deviation of each column
    scales <- apply(x,2,sd)
    ret.val <- sweep(x,2,scales,"/")
}

#this works just fine
x <- array(rnorm(60),dim=c(12,5))
y <- re.scale(x)

#this throws a warning
x <- array(rnorm(60),dim=c(3,4,5))
y <- re.scale(x)

Is there some other function to replace sd without this weird behavior? How would one write re.scale properly? Or a Z-score-by-column function?

like image 961
shabbychef Avatar asked Aug 21 '11 03:08

shabbychef


2 Answers

It is behaving as document in sd's help page. At the very top it announces:

"If x is a matrix or a data frame, a vector of the standard deviation of the columns is returned."

Note it does not say that the arrays are included, so only arrays with two dimensions are included. If you want to stop this behavior, then just make a vector out of it with c():

 sd( c(array(rnorm(60),dim=c(12,5))) )
 # [1] 0.9505643

I see that you added a request for column z scores. Try this for matrices:

colMeans(x)/sd(x)

And this for arrays (although the definition of a "column" may need clarification:

apply(x, 2:3, mean)/apply(x, 2:3, sd)   # will generalize to higher dimensions
like image 140
IRTFM Avatar answered Nov 09 '22 00:11

IRTFM


The actions of sd were changed:

1. version 2.13.2(2011-09-30) and earlier

> set.seed(1)
> sd(array(rnorm(60),dim=c(12,5))) 
[1] 0.8107276 1.1234795 0.7925743 0.6186082 0.9464160

Description

This function computes the standard deviation of the values in x. If na.rm is TRUE then missing values are removed before computation proceeds. If x is a matrix or a data frame, a vector of the standard deviation of the columns is returned.


2. R version 2.14.0(2011-10-31) - 2.15.3(2013-03-01)

> set.seed(1)
> sd(array(rnorm(60),dim=c(12,5))) 
[1] 0.8107276 1.1234795 0.7925743 0.6186082 0.9464160
 WARNING:
sd(<matrix>) is deprecated.
 Use apply(*, 2, sd) instead. 

Details

Prior to R 2.14.0, sd(dfrm) worked directly for a data.frame dfrm. This is now deprecated and you are expected to use sapply(dfrm, sd).


3. R version 3.0.0 (2013-04-03) and later

> sd(array(rnorm(60),dim=c(12,5))) 
[1] 0.8551688
>
(no WARNIG)
like image 31
Yoshiyuki Umemura Avatar answered Nov 09 '22 01:11

Yoshiyuki Umemura