Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scale/normalize columns by group

Tags:

r

scale

dplyr

plyr

I have a data frame that looks like this:

  Store Temperature Unemployment Sum_Sales
1     1       42.31        8.106   1643691
2     1       38.51        8.106   1641957
3     1       39.93        8.106   1611968
4     1       46.63        8.106   1409728
5     1       46.50        8.106   1554807
6     1       57.79        8.106   1439542

For each 'Store', I want to normalize/scale two columns ("Sum_sales" and "Temperature").

Desired output:

  Store Temperature Unemployment Sum_Sales
1     1       1.000        8.106   1.00000
2     1       0.000        8.106   0.94533
3     1       0.374        8.106   0.00000
4     2       0.012        8.106   0.00000
5     2       0.000        8.106   1.00000
6     2       1.000        8.106   0.20550

Here is the normalizing function that I created:

 normalit<-function(m){
   (m - min(m))/(max(m)-min(m))
 }

What I have tried:

df2 <- df %.%
  group_by('Store') %.%
  summarise(Temperature = normalit(Temperature), Sum_Sales = normalit(Sum_Sales)))

Any suggestions/help would be greatly appreciated. Thanks.

like image 851
itjcms18 Avatar asked Nov 15 '14 19:11

itjcms18


People also ask

What is the difference between scale and normalize?

in scaling, you're changing the range of your data, while. in normalization, you're changing the shape of the distribution of your data.

Does scaling normalize data?

Feature scaling is a method used to normalize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.


2 Answers

The issue is that you are using the wrong dplyr verb. Summarize will create one result per group per variable. What you want is mutate. Mutate changes variables and returns a result of the same length as the original. See http://cran.rstudio.com/web/packages/dplyr/vignettes/dplyr.html. Below two approaches using dplyr:

df %>%
    group_by(Store) %>%
    mutate(Temperature = normalit(Temperature), Sum_Sales = normalit(Sum_Sales))

df %>%
    group_by(Store) %>%
    mutate_each(funs(normalit), Temperature, Sum_Sales)

Note: The Store variable is different between your data and desired result. I assumed that @jlhoward got the right data.

like image 177
Vincent Avatar answered Oct 18 '22 20:10

Vincent


Here's a data.table solution. I changed your example a bit to have two type of store.

df <- read.table(header=T,text="Store Temperature Unemployment Sum_Sales
1     1       42.31        8.106   1643691
2     1       38.51        8.106   1641957
3     1       39.93        8.106   1611968
4     2       46.63        8.106   1409728
5     2       46.50        8.106   1554807
6     2       57.79        8.106   1439542")

library(data.table)
DT <- as.data.table(df)
DT[,list(Temperature=normalit(Temperature),Sum_Sales=normalit(Sum_Sales)),
    by=list(Store,Unemployment)]
#    Store Unemployment Temperature Sum_Sales
# 1:     1        8.106  1.00000000 1.0000000
# 2:     1        8.106  0.00000000 0.9453393
# 3:     1        8.106  0.37368421 0.0000000
# 4:     2        8.106  0.01151461 0.0000000
# 5:     2        8.106  0.00000000 1.0000000
# 6:     2        8.106  1.00000000 0.2055018

Note that your normalization will have problems if there is only 1 row for a stoer.

like image 38
jlhoward Avatar answered Oct 18 '22 21:10

jlhoward