I want to do simple computations by groups. As often I used aggregate. To compute the sum of my var by groups gp1, gp2, and gp3 I did:
m.temp <- aggregate(var ~ gp1 + gp2 + gp3, df, sum)
It works well but it was very slow. Before doing it in datatable, I wanted to try to change the syntax of the function to speed the process. I then did:
m.temp2 <- aggregate(df$var,
list(df$gp1, df$gp2, df$gp3),
sum)
Unfortunately for me, a simple verification showed me that these computations are not equivalent.
> identical(m.temp, m.temp2)
[1] FALSE
Variable names are different, but worse, there is a difference of 19 477 observations (rows) between these 2 results and it is not because of some NAs presence...
Here comes then My first question: how come? What is the difference between these 2 syntaxes?
To understand which syntax is the good one, I tried to do it using simple data.table process. Unfortunately I Couldnt get any result since my syntax is not correct, but I do not understand what I missed. I tied:
m.temp4 <- df[, list(sum = sum(df$var)),
by = list(gp1, gp2, gp3)]
finally, I also tried to directly aggregate a new column, with the same absence of results...
df[, new.col := sum(var), by = list(gp1, gp2, gp3)]
What did I do wrong?
Assuming that the dataset is data.table or else convert to one with setDT
library(data.table)
setDT(df)[, new_col := sum(var), by = .(gp1, gp2, gp3)]
In the OP's post, sum was done on the whole column df$var instead of the 'var' elements inside the group, resulting in a single sum value. Remove the df$ and use the unquoted column name.
NOTE: The := creates a new column. If the intention is to summmarise, place it in list or .()
setDT(df)[, .(new_col = sum(var)), by = .(gp1, gp2, gp3)]
Another option is tidyverse
library(tidyverse)
df %>%
group_by(gp1, gp2, gp3) %>%
summarise(new_col = sum(var))
for creating a new column, replace summarise with mutate
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With