I have a data frame that looks like this:
Store Temperature Unemployment Sum_Sales
1 1 42.31 8.106 1643691
2 1 38.51 8.106 1641957
3 1 39.93 8.106 1611968
4 1 46.63 8.106 1409728
5 1 46.50 8.106 1554807
6 1 57.79 8.106 1439542
For each 'Store', I want to normalize/scale two columns ("Sum_sales" and "Temperature").
Desired output:
Store Temperature Unemployment Sum_Sales
1 1 1.000 8.106 1.00000
2 1 0.000 8.106 0.94533
3 1 0.374 8.106 0.00000
4 2 0.012 8.106 0.00000
5 2 0.000 8.106 1.00000
6 2 1.000 8.106 0.20550
Here is the normalizing function that I created:
normalit<-function(m){
(m - min(m))/(max(m)-min(m))
}
What I have tried:
df2 <- df %.%
group_by('Store') %.%
summarise(Temperature = normalit(Temperature), Sum_Sales = normalit(Sum_Sales)))
Any suggestions/help would be greatly appreciated. Thanks.
in scaling, you're changing the range of your data, while. in normalization, you're changing the shape of the distribution of your data.
Feature scaling is a method used to normalize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.
The issue is that you are using the wrong dplyr verb. Summarize will create one result per group per variable. What you want is mutate. Mutate changes variables and returns a result of the same length as the original. See http://cran.rstudio.com/web/packages/dplyr/vignettes/dplyr.html. Below two approaches using dplyr:
df %>%
group_by(Store) %>%
mutate(Temperature = normalit(Temperature), Sum_Sales = normalit(Sum_Sales))
df %>%
group_by(Store) %>%
mutate_each(funs(normalit), Temperature, Sum_Sales)
Note: The Store variable is different between your data and desired result. I assumed that @jlhoward got the right data.
Here's a data.table solution. I changed your example a bit to have two type of store.
df <- read.table(header=T,text="Store Temperature Unemployment Sum_Sales
1 1 42.31 8.106 1643691
2 1 38.51 8.106 1641957
3 1 39.93 8.106 1611968
4 2 46.63 8.106 1409728
5 2 46.50 8.106 1554807
6 2 57.79 8.106 1439542")
library(data.table)
DT <- as.data.table(df)
DT[,list(Temperature=normalit(Temperature),Sum_Sales=normalit(Sum_Sales)),
by=list(Store,Unemployment)]
# Store Unemployment Temperature Sum_Sales
# 1: 1 8.106 1.00000000 1.0000000
# 2: 1 8.106 0.00000000 0.9453393
# 3: 1 8.106 0.37368421 0.0000000
# 4: 2 8.106 0.01151461 0.0000000
# 5: 2 8.106 0.00000000 1.0000000
# 6: 2 8.106 1.00000000 0.2055018
Note that your normalization will have problems if there is only 1 row for a stoer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With