I have found that data.table and dplyr have differing results when trying to do the same thing. I would like to use dplyr syntax, but have it compute in the way that data.table does. The use case is that I want to add subtotals to a table. To do that, I need to do some aggregation to each variable, but then keep the same variable names (in the transformed version). Data.table allows me to perform some aggregation on a variable and keep the same name. Then do another aggregation with that same variable. It will continue to use the untransformed version. Dplyr, however, will use the transformed version.
In the summarize documentation it says:
# Note that with data frames, newly created summaries immediately
# overwrite existing variables
mtcars %>%
group_by(cyl) %>%
summarise(disp = mean(disp), sd = sd(disp))
This is basically the issue I am running into, but I'm wondering if there is a nice workaround. One thing I found was to just name the transformed variable something else then rename it at the end, but that does not look very nice to me. If there is a nice way to do subtotals, that'd be good to know as well. I looked around this site and did not see this exact situation discussed. Any help would be greatly appreciated!
Here I have made a simple example, once with data.table's results, and once with dplyr's. I want to take this simple table and append a subtotal row that is the weighted average of the column of interest (Total).
library(data.table)
library(dplyr)
dt <- data.table(Group = LETTERS[1:5],
Count = c(1000, 1500, 1200, 2000, 5000),
Total = c(50, 300, 600, 400, 1000))
dt[, Count_Dist := Count/sum(Count)]
dt[, .(Count_Dist = sum(Count_Dist), Weighted_Total = sum(Count_Dist*Total))]
dt <- rbind(dt[, .(Group, Count_Dist, Total)],
dt[, .(Group = "All", Count_Dist = sum(Count_Dist), Total = sum(Count_Dist*Total))])
setnames(dt, "Total", "Weighted_Avg_Total")
dt
df <- data.frame(Group = LETTERS[1:5],
Count = c(1000, 1500, 1200, 2000, 5000),
Total = c(50, 300, 600, 400, 1000))
df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
summarize(Count_Dist = sum(Count_Dist),
Weighted_Total = sum(Count_Dist*Total))
df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
select(Group, Count_Dist, Total) %>%
rbind(df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
summarize(Group = "All",
Count_Dist = sum(Count_Dist),
Total = sum(Count_Dist*Total))) %>%
rename(Weighted_Avg_Total = Total)
Thanks again for any help!
summarise() creates a new data frame. It will have one (or more) rows for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input.
Summarize Function in R Programming. As its name implies, the summarize function reduces a data frame to a summary of just one vector or value. Many times, these summaries are calculated by grouping observations using a factor or categorical variables first.
summary statistic is computed using summary() function in R. summary() function is automatically applied to each column. The format of the result depends on the data type of the column. If the column is a numeric variable, mean, median, min, max and quartiles are returned.
A possible solution is to skip the mutate
steps and use transmute
for the first mutate
/select
-step and directly calculate the desired variables from the original variables without creating an intermediate variable for the second mutate
-step:
df %>%
transmute(Group, Count_Dist = Count/sum(Count), Weighted_Avg_Total = Total) %>%
bind_rows(df %>%
summarize(Group = "All",
Count_Dist = sum(Count/sum(Count)),
Weighted_Avg_Total = sum((Count/sum(Count))*Total)))
which gives:
Group Count_Dist Weighted_Avg_Total 1 A 0.09345794 50.0000 2 B 0.14018692 300.0000 3 C 0.11214953 600.0000 4 D 0.18691589 400.0000 5 E 0.46728972 1000.0000 6 All 1.00000000 656.0748
Another possible solution is to alter the order in which the new variables are calculated in dplyr
and then use select
to get the column-order back into what you originally wanted:
df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
select(Group, Count_Dist, Weighted_Avg_Total = Total) %>%
bind_rows(df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
summarize(Group = "All",
Weighted_Avg_Total = sum(Count_Dist*Total),
Count_Dist = sum(Count_Dist)) %>%
select(Group, Count_Dist, Weighted_Avg_Total))
If you want to include the Count
-column as well, you could do (based on my comment from below):
df %>%
transmute(Group = Group, Count_Dist = Count/sum(Count), Weighted_Avg_Total = Total, Count) %>%
bind_rows(df %>%
summarize(Group = "All",
Count_Dist = sum(Count/sum(Count)),
Weighted_Avg_Total = sum((Count/sum(Count))*Total),
Count = sum(Count)))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With