Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select the top N values by group

Tags:

r

aggregate

This is in response to a question asked on the r-help mailing list.

Here are lots of examples of how to find top values by group using sql, so I imagine it's easy to convert that knowledge over using the R sqldf package.

An example: when mtcars is grouped by cyl, here are the top three records for each distinct value of cyl. Note that ties are excluded in this case, but it'd be nice to show some different ways to treat ties.

                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb ranks Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1   2.0 Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2   1.0 Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1   2.0 Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4   3.0 Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4   1.0 Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4   1.5 Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4   1.5 Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4   3.0 

How to find the top or bottom (maximum or minimum) N records per group?

like image 692
Anthony Damico Avatar asked Feb 10 '13 16:02

Anthony Damico


People also ask

How do you find the top 5 values in R?

To get the top values in an R data frame, we can use the head function and if we want the values in decreasing order then sort function will be required. Therefore, we need to use the combination of head and sort function to find the top values in decreasing order.

How do you show the highest value in R?

Maximum value of a column in R can be calculated by using max() function. Max() Function takes column name as argument and calculates the maximum value of that column.


2 Answers

This seems more straightforward using data.table as it performs the sort while setting the key.

So, if I were to get the top 3 records in sort (ascending order), then,

require(data.table) d <- data.table(mtcars, key="cyl") d[, head(.SD, 3), by=cyl] 

does it.

And if you want the descending order

d[, tail(.SD, 3), by=cyl] # Thanks @MatthewDowle 

Edit: To sort out ties using mpg column:

d <- data.table(mtcars, key="cyl") d.out <- d[, .SD[mpg %in% head(sort(unique(mpg)), 3)], by=cyl]  #     cyl  mpg  disp  hp drat    wt  qsec vs am gear carb rank #  1:   4 22.8 108.0  93 3.85 2.320 18.61  1  1    4    1   11 #  2:   4 22.8 140.8  95 3.92 3.150 22.90  1  0    4    2    1 #  3:   4 21.5 120.1  97 3.70 2.465 20.01  1  0    3    1    8 #  4:   4 21.4 121.0 109 4.11 2.780 18.60  1  1    4    2    6 #  5:   6 18.1 225.0 105 2.76 3.460 20.22  1  0    3    1    7 #  6:   6 19.2 167.6 123 3.92 3.440 18.30  1  0    4    4    1 #  7:   6 17.8 167.6 123 3.92 3.440 18.90  1  0    4    4    2 #  8:   8 14.3 360.0 245 3.21 3.570 15.84  0  0    3    4    7 #  9:   8 10.4 472.0 205 2.93 5.250 17.98  0  0    3    4   14 # 10:   8 10.4 460.0 215 3.00 5.424 17.82  0  0    3    4    5 # 11:   8 13.3 350.0 245 3.73 3.840 15.41  0  0    3    4    3  # and for last N elements, of course it is straightforward d.out <- d[, .SD[mpg %in% tail(sort(unique(mpg)), 3)], by=cyl] 
like image 54
Arun Avatar answered Sep 20 '22 17:09

Arun


dplyr does the trick

mtcars %>%  arrange(desc(mpg)) %>%  group_by(cyl) %>% slice(1:2)    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1 2  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1 3  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1 4  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4 5  19.2     8 400.0   175  3.08 3.845 17.05     0     0     3     2 6  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2 
like image 32
Azam Yahya Avatar answered Sep 21 '22 17:09

Azam Yahya