Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the equivalent of dplyr mutate and summarise in data table? [duplicate]

In package dplyr, we have operations:

mtcars %>%
    group_by(cyl) %>%
    summarise(max_mpg = max(mpg)) # output one result for each unique group,
                                  # result has nGroups number of rows.

giving

    cyl max_mpg
  <dbl>   <dbl>
1     4    33.9
2     6    21.4
3     8    19.2

and

mtcars %>%
    group_by(cyl) %>%
    mutate(max_mpg = max(mpg)) # output the same result for every row in the                       
                               # same group, result has same number of rows
                               # as input

giving

Source: local data frame [32 x 12]
Groups: cyl [3]

     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb max_mpg
   (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)   (dbl)
1   21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4    21.4
2   21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4    21.4
3   22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1    33.9
4   21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1    21.4
5   18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2    19.2
6   18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1    21.4
7   14.3     8 360.0   245  3.21 3.570 15.84     0     0     3     4    19.2
8   24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2    33.9
9   22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2    33.9
10  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4    21.4
...

What is the equivalent of these operations in data.table?

I think mutate is given by

data.table(mtcars) %>% 
    .[, max := max(mpg), by = cyl]

but I don't know how to get the equivalent of summarise. I can add that for whatever reason if you do not have := it does a summarise, e.g.:

data.table(mtcars) %>% .[, max(mpg), by = cyl]

gives

   cyl   V1
1:   6 21.4
2:   4 33.9
3:   8 19.2

but it is not obvious how to assign a name to the V1 column created.

like image 469
Alex Avatar asked Jul 07 '16 01:07

Alex


People also ask

Can you use dplyr on data table?

Each dplyr verb must do some work to convert dplyr syntax to data. table syntax. This takes time proportional to the complexity of the input code, not the input data, so should be a negligible overhead for large datasets.

What is the difference between Summarise and mutate in R?

mutate() either changes an existing column or adds a new one. summarise() calculates a single value (per group).

What does Summarise N N ()) do in R?

summarise() creates a new data frame. It will have one (or more) rows for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input.

Is Base R faster than dplyr?

In my benchmarking project, Base R sorts a dataset much faster than dplyr or data.


1 Answers

library(data.table)
MT <- data.table(mtcars)

# summarise
MT[, .(max_mpg = max(mpg)), by = cyl]

   cyl max_mpg
1:   6    21.4
2:   4    33.9
3:   8    19.2

# mutate
MT[, max_mpg := max(mpg), by = cyl]

max_mpg is added to MT, but the data is not displayed with this command

To display the data:

MT[, max_mpg := max(mpg), by = cyl][]

Since the data has 32 rows, displaying just the head:

MT[, max_mpg := max(mpg), by = cyl][,head(.SD, 6)]

     mpg cyl disp  hp drat    wt  qsec vs am gear carb max_mpg
 1: 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4    21.4
 2: 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4    21.4
 3: 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1    33.9
 4: 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1    21.4
 5: 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2    19.2
 6: 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1    21.4

If you wish to sort by cyl: (Code suggested by @thelatemail)

MT[, .(max_mpg = max(mpg)), keyby=cyl]

   cyl max_mpg
1:   4    33.9
2:   6    21.4
3:   8    19.2

Edit

Adding this in response to @Alex's comment

data("mtcars")
setDT(mtcars)[, .(max_mpg = max(mpg)), by = cyl]
like image 195
Sumedh Avatar answered Sep 29 '22 16:09

Sumedh