Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to normalise subgroups from a grouped data frame in R

Tags:

r

dplyr

I have a data frame with two numerical variables fatcontent and saltcontent plus two factor variables cond and spice that describe the different treatments. In this data frame each measurement for the numerical varibles was taken twice.

a <- data.frame(cond = rep(c("uncooked", "fried", "steamed", "baked", "grilled"),
                       each = 2, times = 3),
                spice = rep(c("none", "chilli", "basil"), each = 10),
                fatcontent = c(4, 5, 6828, 7530, 6910, 7132, 5885, 613, 2845, 2867,
                               25, 18, 2385, 33227, 4233, 4023, 953, 1025, 4465, 5016,
                               5, 5, 10235, 12545, 5511, 5111, 596, 585, 4012, 3633),
                saltcontent = c(2, 5, 4733, 5500, 5724, 15885, 14885, 217, 193, 148,
                                6, 4, 26738, 24738, 22738, 23738, 267, 256, 1121, 1558,
                                1, 1, 21738, 20738, 26738, 27738, 195, 202, 129, 131)
                )

Now, I wish to nomalise (that means divide in this case) the numerical variables for each spice group by the mean of the uncooked condition.
E.g. for a$spice == "none"

       cond  spice fatcontent saltcontent  
1  uncooked   none          4           2  
2  uncooked   none          5           5  
3     fried   none       6828        4733  
4     fried   none       7530        5500  
5   steamed   none       6910        5724  
6   steamed   none       7132       15885  
7     baked   none       5885       14885  
8     baked   none        613         217  
9   grilled   none       2845         193  
10  grilled   none       2867         148   

After normalisation:

       cond spice   fatcontent  saltcontent
1  uncooked  none    0.8888889    0.5714286
2  uncooked  none    1.1111111    1.4285714
3     fried  none 1517.3333333 1352.2857143
4     fried  none 1673.3333333 1571.4285714
5   steamed  none 1535.5555556 1635.4285714
6   steamed  none 1584.8888889 4538.5714286
7     baked  none 1307.7777778 4252.8571429
8     baked  none  136.2222222   62.0000000
9   grilled  none  632.2222222   55.1428571
10  grilled  none  637.1111111   42.2857143

My questions is how can I do this for all the groups and variables in the data frame? I assume I could use the dplyr package but I am not sure what is the best way. I appreciate any help!

like image 742
karnowski Avatar asked Dec 12 '14 01:12

karnowski


People also ask

How do I normalize scale data in R?

Normalize Data with Min-Max Scaling in R Another efficient way of Normalizing values is through the Min-Max Scaling method. With Min-Max Scaling, we scale the data values between a range of 0 to 1 only. Due to this, the effect of outliers on the data values suppresses to a certain extent.

How many methods exist for normalizing the data in R?

Two common ways to normalize (or “scale”) variables include: Min-Max Normalization: (X – min(X)) / (max(X) – min(X)) Z-Score Standardization: (X – μ) / σ

What does normalize function do in R?

In this article, we will discuss how to normalize data in the R programming language. Normalizing Data is the approach to scale the data into a fixed range usually 0 to 1 so that it reduces the scale of the variables.


2 Answers

A succinct way to normalize the data would be to include the "uncooked" condition right in the mean calculation so you don't need to filter, summarise, join and recalculate. Doing this with mutate_each means you only need to type it once.

group_by(a, spice) %>%
  mutate_each(funs(./mean(.[cond == "uncooked"])), -cond)

#Source: local data frame [30 x 4]
#Groups: spice
#
#       cond  spice   fatcontent  saltcontent
#1  uncooked   none    0.8888889 5.714286e-01
#2  uncooked   none    1.1111111 1.428571e+00
#3     fried   none 1517.3333333 1.352286e+03
#4     fried   none 1673.3333333 1.571429e+03
#5   steamed   none 1535.5555556 1.635429e+03
#6   steamed   none 1584.8888889 4.538571e+03
#7     baked   none 1307.7777778 4.252857e+03
#8     baked   none  136.2222222 6.200000e+01
#9   grilled   none  632.2222222 5.514286e+01
#10  grilled   none  637.1111111 4.228571e+01
# ... etc
like image 148
talat Avatar answered Sep 20 '22 04:09

talat


I think this is what you are after. You want to find mean for each spice condition using uncooked data points. That is something I have done in my first step. Then, I wanted to add fatmean and saltmean in ana to your data frame, a. If your data is really huge, this may not be a memory efficient way. But, I used left_join to merge ana and a. I, then, did division in mutate for each spice condition. Finally, I dropped two columns for tidying up the results using select.

### Find mean for each spice condition using uncooked data points                
ana <- group_by(filter(a, cond == "uncooked"), spice) %>%
       summarise(fatmean = mean(fatcontent), saltmean = mean(saltcontent)) 

 #   spice fatmean saltmean
 #1  basil     5.0      1.0
 #2 chilli    21.5      5.0
 #3   none     4.5      3.5

left_join(a, ana, by = "spice") %>%
group_by(spice) %>%
mutate(fatcontent = fatcontent / fatmean,
       saltcontent = saltcontent / saltmean) %>%
select(-c(fatmean, saltmean))

# A part of the results
#       cond spice   fatcontent  saltcontent
#1  uncooked  none    0.8888889    0.5714286
#2  uncooked  none    1.1111111    1.4285714
#3     fried  none 1517.3333333 1352.2857143
#4     fried  none 1673.3333333 1571.4285714
#5   steamed  none 1535.5555556 1635.4285714
#6   steamed  none 1584.8888889 4538.5714286
#7     baked  none 1307.7777778 4252.8571429
#8     baked  none  136.2222222   62.0000000
#9   grilled  none  632.2222222   55.1428571
#10  grilled  none  637.1111111   42.2857143

If you do all things in one piping, it would be something like this:

group_by(filter(a, cond == "uncooked"), spice) %>%
    summarise(fatmean = mean(fatcontent), saltmean = mean(saltcontent)) %>%
    left_join(a, ., by = "spice") %>% #right_join is possible with the dev dplyr
    group_by(spice) %>%
    mutate(fatcontent = fatcontent / fatmean,
           saltcontent = saltcontent / saltmean) %>%
    select(-c(fatmean, saltmean))
like image 36
jazzurro Avatar answered Sep 22 '22 04:09

jazzurro