Using dplyr to summarize and keep the same variable name

Tags:

I have found that data.table and dplyr have differing results when trying to do the same thing. I would like to use dplyr syntax, but have it compute in the way that data.table does. The use case is that I want to add subtotals to a table. To do that, I need to do some aggregation to each variable, but then keep the same variable names (in the transformed version). Data.table allows me to perform some aggregation on a variable and keep the same name. Then do another aggregation with that same variable. It will continue to use the untransformed version. Dplyr, however, will use the transformed version.

In the summarize documentation it says:

# Note that with data frames, newly created summaries immediately
# overwrite existing variables
mtcars %>%
  group_by(cyl) %>%
  summarise(disp = mean(disp), sd = sd(disp))

This is basically the issue I am running into, but I'm wondering if there is a nice workaround. One thing I found was to just name the transformed variable something else then rename it at the end, but that does not look very nice to me. If there is a nice way to do subtotals, that'd be good to know as well. I looked around this site and did not see this exact situation discussed. Any help would be greatly appreciated!

Here I have made a simple example, once with data.table's results, and once with dplyr's. I want to take this simple table and append a subtotal row that is the weighted average of the column of interest (Total).

library(data.table)
library(dplyr)

dt <- data.table(Group = LETTERS[1:5],
                 Count = c(1000, 1500, 1200, 2000, 5000),
                 Total = c(50, 300, 600, 400, 1000))
dt[, Count_Dist := Count/sum(Count)]
dt[, .(Count_Dist = sum(Count_Dist), Weighted_Total = sum(Count_Dist*Total))]

dt <- rbind(dt[, .(Group, Count_Dist, Total)],
      dt[, .(Group = "All", Count_Dist = sum(Count_Dist), Total = sum(Count_Dist*Total))])
setnames(dt, "Total", "Weighted_Avg_Total")

dt

df <- data.frame(Group = LETTERS[1:5],
                 Count = c(1000, 1500, 1200, 2000, 5000),
                 Total = c(50, 300, 600, 400, 1000))

df %>%
  mutate(Count_Dist = Count/sum(Count)) %>%
  summarize(Count_Dist = sum(Count_Dist),
            Weighted_Total = sum(Count_Dist*Total))

df %>% 
  mutate(Count_Dist = Count/sum(Count)) %>%
  select(Group, Count_Dist, Total) %>% 
  rbind(df %>%
          mutate(Count_Dist = Count/sum(Count)) %>%
          summarize(Group = "All",
                    Count_Dist = sum(Count_Dist),
                    Total = sum(Count_Dist*Total))) %>% 
  rename(Weighted_Avg_Total = Total)

Thanks again for any help!

487

asked Jan 20 '18 15:01

Hutch3232

1 Answers

A possible solution is to skip the mutate steps and use transmute for the first mutate/select-step and directly calculate the desired variables from the original variables without creating an intermediate variable for the second mutate-step:

df %>% 
  transmute(Group, Count_Dist = Count/sum(Count), Weighted_Avg_Total = Total) %>% 
  bind_rows(df %>%
              summarize(Group = "All",
                        Count_Dist = sum(Count/sum(Count)),
                        Weighted_Avg_Total = sum((Count/sum(Count))*Total)))

which gives:

  Group Count_Dist Weighted_Avg_Total
1     A 0.09345794            50.0000
2     B 0.14018692           300.0000
3     C 0.11214953           600.0000
4     D 0.18691589           400.0000
5     E 0.46728972          1000.0000
6   All 1.00000000           656.0748

Another possible solution is to alter the order in which the new variables are calculated in dplyr and then use select to get the column-order back into what you originally wanted:

df %>% 
  mutate(Count_Dist = Count/sum(Count)) %>%
  select(Group, Count_Dist, Weighted_Avg_Total = Total) %>% 
  bind_rows(df %>%
              mutate(Count_Dist = Count/sum(Count)) %>%
              summarize(Group = "All",
                        Weighted_Avg_Total = sum(Count_Dist*Total),
                        Count_Dist = sum(Count_Dist)) %>% 
              select(Group, Count_Dist, Weighted_Avg_Total))

If you want to include the Count-column as well, you could do (based on my comment from below):

df %>% 
  transmute(Group = Group, Count_Dist = Count/sum(Count), Weighted_Avg_Total = Total, Count) %>% 
  bind_rows(df %>%
              summarize(Group = "All",
                        Count_Dist = sum(Count/sum(Count)),
                        Weighted_Avg_Total = sum((Count/sum(Count))*Total),
                        Count = sum(Count)))

answered Sep 19 '22 13:09

Jaap

Related questions
                            
                                Merge weights of same model trained on 2 different computers using tensorflow
                            
                                Payload contains two or more files with the same destination path
                            
                                How can I unmute (and turn off autoplay) a YouTube video for twentyseventeen?
                            
                                GCC inline assembly: "g" constraint and parameter size
                            
                                How to execute multiple artisan commands without waiting for one to end execution?
                            
                                rendering first step of multistep form wizard as partial in another controller's show action
                            
                                Difference between getExternalStorageDirectory() and getDataDirectory()
                            
                                How to match template friend function in a template class
                            
                                What is the best DynamoDB architecture for storing several document types?
                            
                                Bootstrap: Control multiple tab panels with single tab nav
                            
                                Pygame Error "No available video device" on online 'repl.it'
                            
                                How to handle missing JSON data when creating objects? (Typescript)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With