Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use data.table to count and aggregate / summarize a column

I want to count and aggregate(sum) a column in a data.table, and couldn't find the most efficient way to do this. This seems to be close to what I want R summarizing multiple columns with data.table.

My data:

set.seed(321) dat <- data.table(MNTH = c(rep(201501,4), rep(201502,3), rep(201503,5), rep(201504,4)),                    VAR = sample(c(0,1), 16, replace=T))  > dat      MNTH VAR  1: 201501   1  2: 201501   1  3: 201501   0  4: 201501   0  5: 201502   0  6: 201502   0  7: 201502   0  8: 201503   0  9: 201503   0 10: 201503   1 11: 201503   1 12: 201503   0 13: 201504   1 14: 201504   0 15: 201504   1 16: 201504   0 

I want to both count and sum VAR by MNTH using data.table. The desired result:

    MNTH COUNT VAR 1 201501     4   2 2 201502     3   0 3 201503     5   2 4 201504     4   2 
like image 468
Whitebeard Avatar asked Sep 28 '15 15:09

Whitebeard


People also ask

How do you aggregate a table in R?

In order to use the aggregate function for mean in R, you will need to specify the numerical variable on the first argument, the categorical (as a list) on the second and the function to be applied (in this case mean ) on the third. An alternative is to specify a formula of the form: numerical ~ categorical .

How do you count data in a table?

To counts all of the rows in a table, whether they contain NULL values or not, use COUNT(*). That form of the COUNT() function basically returns the number of rows in a result set returned by a SELECT statement.

What is .n in data table?

Think of .N as a variable for the number of instances. For example: dt <- data.table(a = LETTERS[c(1,1:3)], b = 4:7) dt[.N] # returns the last row # a b # 1: C 7.


1 Answers

The post you are referring to gives a method on how to apply one aggregation method to several columns. If you want to apply different aggregation methods to different columns, you can do:

dat[, .(count = .N, var = sum(VAR)), by = MNTH] 

this results in:

     MNTH count var 1: 201501     4   2 2: 201502     3   0 3: 201503     5   2 4: 201504     4   2 

You can also add these values to your existing dataset by updating your dataset by reference:

dat[, `:=` (count = .N, var = sum(VAR)), by = MNTH] 

this results in:

> dat       MNTH VAR count var  1: 201501   1     4   2  2: 201501   1     4   2  3: 201501   0     4   2  4: 201501   0     4   2  5: 201502   0     3   0  6: 201502   0     3   0  7: 201502   0     3   0  8: 201503   0     5   2  9: 201503   0     5   2 10: 201503   1     5   2 11: 201503   1     5   2 12: 201503   0     5   2 13: 201504   1     4   2 14: 201504   0     4   2 15: 201504   1     4   2 16: 201504   0     4   2 

For further reading about how to use data.table syntax, see the Getting started guides on the GitHub wiki.

like image 181
Jaap Avatar answered Sep 19 '22 15:09

Jaap