Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does scale return NaN for zero variance columns?

Tags:

r

Consider the following matrix:

x <- matrix(c(1,1,1,3),2)
x
     [,1] [,2]
[1,]    1    1
[2,]    1    3

When calling scale with this, NaN values are returned for the first column, which has zero variance:

scale(x)
     [,1]       [,2]
[1,]  NaN -0.7071068
[2,]  NaN  0.7071068
attr(,"scaled:center")
[1] 1 2
attr(,"scaled:scale")
[1] 0.000000 1.414214

However, I would expect it to return 0. Is this a bug or am I misunderstanding what this is and should return?

The work around for what I want is:

y <- scale(x)
y[is.nan(y)] <- 0

But this involves the use of an extra variable, is there a more elegant solution?

like image 507
James Avatar asked Nov 30 '22 21:11

James


2 Answers

You could use the following workaround:

apply(x, 2, function(y) (y - mean(y)) / sd(y) ^ as.logical(sd(y)))

     [,1]       [,2]
[1,]    0 -0.7071068
[2,]    0  0.7071068
like image 26
Sven Hohenstein Avatar answered Dec 22 '22 00:12

Sven Hohenstein


Because scale divides by the variance, it must do this.

Continuous variables really aren't supposed have ties, much less zero variance, and it is not appropriate to scale a discrete or categorical variable.

like image 114
Matthew Lundberg Avatar answered Dec 21 '22 22:12

Matthew Lundberg