Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - Per group computations - data.table and aggregate()

I want to do simple computations by groups. As often I used aggregate. To compute the sum of my var by groups gp1, gp2, and gp3 I did:

m.temp  <- aggregate(var ~ gp1 + gp2 + gp3, df, sum)

It works well but it was very slow. Before doing it in datatable, I wanted to try to change the syntax of the function to speed the process. I then did:

m.temp2 <- aggregate(df$var, 
                     list(df$gp1, df$gp2, df$gp3), 
                     sum)

Unfortunately for me, a simple verification showed me that these computations are not equivalent.

> identical(m.temp, m.temp2)
[1] FALSE

Variable names are different, but worse, there is a difference of 19 477 observations (rows) between these 2 results and it is not because of some NAs presence...

Here comes then My first question: how come? What is the difference between these 2 syntaxes?

To understand which syntax is the good one, I tried to do it using simple data.table process. Unfortunately I Couldnt get any result since my syntax is not correct, but I do not understand what I missed. I tied:

m.temp4 <- df[, list(sum = sum(df$var)), 
                      by = list(gp1, gp2, gp3)]

finally, I also tried to directly aggregate a new column, with the same absence of results...

df[, new.col := sum(var), by = list(gp1, gp2, gp3)] 

What did I do wrong?

like image 608
TeYaP Avatar asked Jun 21 '26 01:06

TeYaP


1 Answers

Assuming that the dataset is data.table or else convert to one with setDT

library(data.table)
setDT(df)[, new_col := sum(var), by = .(gp1, gp2, gp3)]

In the OP's post, sum was done on the whole column df$var instead of the 'var' elements inside the group, resulting in a single sum value. Remove the df$ and use the unquoted column name.

NOTE: The := creates a new column. If the intention is to summmarise, place it in list or .()

setDT(df)[, .(new_col =  sum(var)), by = .(gp1, gp2, gp3)]

Another option is tidyverse

library(tidyverse)
df %>%
    group_by(gp1, gp2, gp3) %>%
    summarise(new_col = sum(var))

for creating a new column, replace summarise with mutate

like image 181
akrun Avatar answered Jun 23 '26 12:06

akrun