Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Standardize data columns in R

I have to assume you meant to say that you wanted a mean of 0 and a standard deviation of 1. If your data is in a dataframe and all the columns are numeric you can simply call the scale function on the data to do what you want.

dat <- data.frame(x = rnorm(10, 30, .2), y = runif(10, 3, 5))
scaled.dat <- scale(dat)

# check that we get mean of 0 and sd of 1
colMeans(scaled.dat)  # faster version of apply(scaled.dat, 2, mean)
apply(scaled.dat, 2, sd)

Using built in functions is classy. Like this cat:

enter image description here


Realizing that the question is old and one answer is accepted, I'll provide another answer for reference.

scale is limited by the fact that it scales all variables. The solution below allows to scale only specific variable names while preserving other variables unchanged (and the variable names could be dynamically generated):

library(dplyr)

set.seed(1234)
dat <- data.frame(x = rnorm(10, 30, .2), 
                  y = runif(10, 3, 5),
                  z = runif(10, 10, 20))
dat

dat2 <- dat %>% mutate_at(c("y", "z"), ~(scale(.) %>% as.vector))
dat2

which gives me this:

> dat
          x        y        z
1  29.75859 3.633225 14.56091
2  30.05549 3.605387 12.65187
3  30.21689 3.318092 13.04672
4  29.53086 3.079992 15.07307
5  30.08582 3.437599 11.81096
6  30.10121 4.621197 17.59671
7  29.88505 4.051395 12.01248
8  29.89067 4.829316 12.58810
9  29.88711 4.662690 19.92150
10 29.82199 3.091541 18.07352

and

> dat2 <- dat %>% mutate_at(c("y", "z"), ~(scale(.) %>% as.vector))
> dat2
          x          y           z
1  29.75859 -0.3004815 -0.06016029
2  30.05549 -0.3423437 -0.72529604
3  30.21689 -0.7743696 -0.58772361
4  29.53086 -1.1324181  0.11828039
5  30.08582 -0.5946582 -1.01827752
6  30.10121  1.1852038  0.99754666
7  29.88505  0.3283513 -0.94806607
8  29.89067  1.4981677 -0.74751378
9  29.88711  1.2475998  1.80753470
10 29.82199 -1.1150515  1.16367556

EDIT 1 (2016): Addressed Julian's comment: the output of scale is Nx1 matrix so ideally we should add an as.vector to convert the matrix type back into a vector type. Thanks Julian!

EDIT 2 (2019): Quoting Duccio A.'s comment: For the latest dplyr (version 0.8) you need to change dplyr::funcs with list, like dat %>% mutate_each_(list(~scale(.) %>% as.vector), vars=c("y","z"))

EDIT 3 (2020): Thanks to @mj_whales: the old solution is deprecated and now we need to use mutate_at.


This is 3 years old. Still, I feel I have to add the following:

The most common normalization is the z-transformation, where you subtract the mean and divide by the standard deviation of your variable. The result will have mean=0 and sd=1.

For that, you don't need any package.

zVar <- (myVar - mean(myVar)) / sd(myVar)

That's it.


'Caret' package provides methods for preprocessing data (e.g. centering and scaling). You could also use the following code:

library(caret)
# Assuming goal class is column 10
preObj <- preProcess(data[, -10], method=c("center", "scale"))
newData <- predict(preObj, data[, -10])

More details: http://www.inside-r.org/node/86978


When I used the solution stated by Dason, instead of getting a data frame as a result, I got a vector of numbers (the scaled values of my df).

In case someone is having the same trouble, you have to add as.data.frame() to the code, like this:

df.scaled <- as.data.frame(scale(df))

I hope this is will be useful for ppl having the same issue!


You can easily normalize the data also using data.Normalization function in clusterSim package. It provides different method of data normalization.

    data.Normalization (x,type="n0",normalization="column")

Arguments

x
vector, matrix or dataset type
type of normalization: n0 - without normalization

n1 - standardization ((x-mean)/sd)

n2 - positional standardization ((x-median)/mad)

n3 - unitization ((x-mean)/range)

n3a - positional unitization ((x-median)/range)

n4 - unitization with zero minimum ((x-min)/range)

n5 - normalization in range <-1,1> ((x-mean)/max(abs(x-mean)))

n5a - positional normalization in range <-1,1> ((x-median)/max(abs(x-median)))

n6 - quotient transformation (x/sd)

n6a - positional quotient transformation (x/mad)

n7 - quotient transformation (x/range)

n8 - quotient transformation (x/max)

n9 - quotient transformation (x/mean)

n9a - positional quotient transformation (x/median)

n10 - quotient transformation (x/sum)

n11 - quotient transformation (x/sqrt(SSQ))

n12 - normalization ((x-mean)/sqrt(sum((x-mean)^2)))

n12a - positional normalization ((x-median)/sqrt(sum((x-median)^2)))

n13 - normalization with zero being the central point ((x-midrange)/(range/2))

normalization
"column" - normalization by variable, "row" - normalization by object