I want to count and aggregate(sum) a column in a <code>data.table</code>, and couldn't find the most efficient way to do this. This seems to be close to what I want R summarizing multiple columns with data.table. My data: <pre class="prettyprint"><code>set.seed(321) dat <- data.table(MNTH = c(rep(201501,4), rep(201502,3), rep(201503,5), rep(201504,4)), VAR = sample(c(0,1), 16, replace=T)) > dat MNTH VAR 1: 201501 1 2: 201501 1 3: 201501 0 4: 201501 0 5: 201502 0 6: 201502 0 7: 201502 0 8: 201503 0 9: 201503 0 10: 201503 1 11: 201503 1 12: 201503 0 13: 201504 1 14: 201504 0 15: 201504 1 16: 201504 0 </code></pre> I want to both count and sum <code>VAR</code> by <code>MNTH</code> using data.table. The desired result: <pre class="prettyprint"><code> MNTH COUNT VAR 1 201501 4 2 2 201502 3 0 3 201503 5 2 4 201504 4 2 </code></pre>

The post you are referring to gives a method on how to apply one aggregation method to several columns. If you want to apply different aggregation methods to different columns, you can do: <pre class="prettyprint"><code>dat[, .(count = .N, var = sum(VAR)), by = MNTH] </code></pre> this results in: <blockquote> <pre class="prettyprint"><code> MNTH count var 1: 201501 4 2 2: 201502 3 0 3: 201503 5 2 4: 201504 4 2 </code></pre> </blockquote> You can also add these values to your existing dataset by updating your dataset by reference: <pre class="prettyprint"><code>dat[, `:=` (count = .N, var = sum(VAR)), by = MNTH] </code></pre> this results in: <blockquote> <pre class="prettyprint"><code>> dat MNTH VAR count var 1: 201501 1 4 2 2: 201501 1 4 2 3: 201501 0 4 2 4: 201501 0 4 2 5: 201502 0 3 0 6: 201502 0 3 0 7: 201502 0 3 0 8: 201503 0 5 2 9: 201503 0 5 2 10: 201503 1 5 2 11: 201503 1 5 2 12: 201503 0 5 2 13: 201504 1 4 2 14: 201504 0 4 2 15: 201504 1 4 2 16: 201504 0 4 2 </code></pre> </blockquote> For further reading about how to use data.table syntax, see the Getting started guides on the GitHub wiki.

Use data.table to count and aggregate / summarize a column

Tags:

dataframe

r

aggregate

data.table

I want to count and aggregate(sum) a column in a data.table, and couldn't find the most efficient way to do this. This seems to be close to what I want R summarizing multiple columns with data.table.

My data:

set.seed(321) dat <- data.table(MNTH = c(rep(201501,4), rep(201502,3), rep(201503,5), rep(201504,4)),                    VAR = sample(c(0,1), 16, replace=T))  > dat      MNTH VAR  1: 201501   1  2: 201501   1  3: 201501   0  4: 201501   0  5: 201502   0  6: 201502   0  7: 201502   0  8: 201503   0  9: 201503   0 10: 201503   1 11: 201503   1 12: 201503   0 13: 201504   1 14: 201504   0 15: 201504   1 16: 201504   0

I want to both count and sum VAR by MNTH using data.table. The desired result:

    MNTH COUNT VAR 1 201501     4   2 2 201502     3   0 3 201503     5   2 4 201504     4   2

468

asked Sep 28 '15 15:09

Whitebeard

1 Answers

The post you are referring to gives a method on how to apply one aggregation method to several columns. If you want to apply different aggregation methods to different columns, you can do:

dat[, .(count = .N, var = sum(VAR)), by = MNTH]

this results in:

     MNTH count var 1: 201501     4   2 2: 201502     3   0 3: 201503     5   2 4: 201504     4   2

You can also add these values to your existing dataset by updating your dataset by reference:

dat[, `:=` (count = .N, var = sum(VAR)), by = MNTH]

this results in:

> dat       MNTH VAR count var  1: 201501   1     4   2  2: 201501   1     4   2  3: 201501   0     4   2  4: 201501   0     4   2  5: 201502   0     3   0  6: 201502   0     3   0  7: 201502   0     3   0  8: 201503   0     5   2  9: 201503   0     5   2 10: 201503   1     5   2 11: 201503   1     5   2 12: 201503   0     5   2 13: 201504   1     4   2 14: 201504   0     4   2 15: 201504   1     4   2 16: 201504   0     4   2

For further reading about how to use data.table syntax, see the Getting started guides on the GitHub wiki.

181

answered Sep 19 '22 15:09

Jaap

Related questions
                            
                                Sources on S4 objects, methods and programming in R [closed]
                            
                                Extending ggplot2 properly?
                            
                                R: How to make second level indented bullet points using RMarkdown ioslides?
                            
                                higher level functions in R - is there an official compose operator or curry function?
                            
                                is there a way to call R functions from C# and retrieve the result in C#
                            
                                Plotting a "sequence logo" using ggplot2?
                            
                                How can I tell when my dataset in R is going to be too large?
                            
                                How to use tabPanel as input in R Shiny?
                            
                                Adding minor tick marks to the x axis in ggplot2 (with no labels)
                            
                                Reading data from Microsoft SQL Server into R
                            
                                Why am I getting "algorithm did not converge" and "fitted prob numerically 0 or 1" warnings with glm?
                            
                                Dynamic column names in data.table
                            
                                Dplyr join on by=(a = b), where a and b are variables containing strings?
                            
                                How to define a vectorized function in R
                            
                                Replace missing values (NA) with blank (empty string)
                            
                                what is the difference between names and colnames
                            
                                How to update a package in R?
                            
                                Extracting coefficient variable names from glmnet into a data.frame
                            
                                RStudio enters debug mode for every function error - how can I stop it?
                            
                                Why is using assign bad?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With