In package dplyr
, we have operations:
mtcars %>%
group_by(cyl) %>%
summarise(max_mpg = max(mpg)) # output one result for each unique group,
# result has nGroups number of rows.
giving
cyl max_mpg
<dbl> <dbl>
1 4 33.9
2 6 21.4
3 8 19.2
and
mtcars %>%
group_by(cyl) %>%
mutate(max_mpg = max(mpg)) # output the same result for every row in the
# same group, result has same number of rows
# as input
giving
Source: local data frame [32 x 12]
Groups: cyl [3]
mpg cyl disp hp drat wt qsec vs am gear carb max_mpg
(dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 21.4
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 21.4
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 33.9
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 21.4
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 19.2
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 21.4
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 19.2
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 33.9
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 33.9
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 21.4
...
What is the equivalent of these operations in data.table
?
I think mutate
is given by
data.table(mtcars) %>%
.[, max := max(mpg), by = cyl]
but I don't know how to get the equivalent of summarise
. I can add that for whatever reason if you do not have :=
it does a summarise
, e.g.:
data.table(mtcars) %>% .[, max(mpg), by = cyl]
gives
cyl V1
1: 6 21.4
2: 4 33.9
3: 8 19.2
but it is not obvious how to assign a name to the V1
column created.
Each dplyr verb must do some work to convert dplyr syntax to data. table syntax. This takes time proportional to the complexity of the input code, not the input data, so should be a negligible overhead for large datasets.
mutate() either changes an existing column or adds a new one. summarise() calculates a single value (per group).
summarise() creates a new data frame. It will have one (or more) rows for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input.
In my benchmarking project, Base R sorts a dataset much faster than dplyr or data.
library(data.table)
MT <- data.table(mtcars)
# summarise
MT[, .(max_mpg = max(mpg)), by = cyl]
cyl max_mpg
1: 6 21.4
2: 4 33.9
3: 8 19.2
# mutate
MT[, max_mpg := max(mpg), by = cyl]
max_mpg
is added to MT
, but the data is not displayed with this command
To display the data:
MT[, max_mpg := max(mpg), by = cyl][]
Since the data has 32 rows, displaying just the head:
MT[, max_mpg := max(mpg), by = cyl][,head(.SD, 6)]
mpg cyl disp hp drat wt qsec vs am gear carb max_mpg
1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 21.4
2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 21.4
3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 33.9
4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 21.4
5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 19.2
6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 21.4
If you wish to sort by cyl
: (Code suggested by @thelatemail)
MT[, .(max_mpg = max(mpg)), keyby=cyl]
cyl max_mpg
1: 4 33.9
2: 6 21.4
3: 8 19.2
Edit
Adding this in response to @Alex's comment
data("mtcars")
setDT(mtcars)[, .(max_mpg = max(mpg)), by = cyl]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With