Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding `scale` in R

I'm trying to understand the definition of scale that R provides. I have data (mydata) that I want to make a heat map with, and there is a VERY strong positive skew. I've created a heatmap with a dendrogram for both scale(mydata) and log(my data), and the dendrograms are different for both. Why? What does it mean to scale my data, versus log transform my data? And which would be more appropriate if I want to look at the dendrogram illustrating the relationship between the columns of my data?

Thank you for any help! I've read the definitions but they are whooping over my head.

like image 828
Jen Avatar asked Nov 28 '13 01:11

Jen


People also ask

How does scale () in R work?

scale() function in R Language is a generic function which centers and scales the columns of a numeric matrix. The center parameter takes either numeric alike vector or logical value. If the numeric vector is provided, then each column of the matrix has the corresponding value from center subtracted from it.

How do you calculate scale in R?

The scale() function with default settings will calculate the mean and standard deviation of the entire vector, then “scale” each element by those values by subtracting the mean and dividing by the sd. If you use the scale(x, scale=FALSE), it will only subtract the mean but not divide by the std deviation.

Why do we scale in R?

Scaling is a way to compare data that is not measured in the same way. The scale function in R handles this task for you by providing a way to normalize the data so that the differences are weeded out. It is a simple solution to a common problem in data science.

What is feature scaling in R?

Feature Scaling in Machine Learning is a strategy for putting the data's independent features into a set range. It's done as part of the data pre-processing. Given a data set with features like Age, Income, and brand, with a total population of 5000 persons, each with these independent data elements.


1 Answers

log simply takes the logarithm (base e, by default) of each element of the vector.
scale, with default settings, will calculate the mean and standard deviation of the entire vector, then "scale" each element by those values by subtracting the mean and dividing by the sd. (If you use scale(x, scale=FALSE), it will only subtract the mean but not divide by the std deviation.)

Note that this will give you the same values

   set.seed(1)    x <- runif(7)     # Manually scaling    (x - mean(x)) / sd(x)     scale(x) 
like image 119
Ricardo Saporta Avatar answered Sep 22 '22 18:09

Ricardo Saporta