This is in response to a question asked on the r-help mailing list.
Here are lots of examples of how to find top values by group using sql
, so I imagine it's easy to convert that knowledge over using the R sqldf
package.
An example: when mtcars
is grouped by cyl
, here are the top three records for each distinct value of cyl
. Note that ties are excluded in this case, but it'd be nice to show some different ways to treat ties.
mpg cyl disp hp drat wt qsec vs am gear carb ranks Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 2.0 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 1.0 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 2.0 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 3.0 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 1.0 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 1.5 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 1.5 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 3.0
How to find the top or bottom (maximum or minimum) N records per group?
To get the top values in an R data frame, we can use the head function and if we want the values in decreasing order then sort function will be required. Therefore, we need to use the combination of head and sort function to find the top values in decreasing order.
Maximum value of a column in R can be calculated by using max() function. Max() Function takes column name as argument and calculates the maximum value of that column.
This seems more straightforward using data.table
as it performs the sort while setting the key.
So, if I were to get the top 3 records in sort (ascending order), then,
require(data.table) d <- data.table(mtcars, key="cyl") d[, head(.SD, 3), by=cyl]
does it.
And if you want the descending order
d[, tail(.SD, 3), by=cyl] # Thanks @MatthewDowle
Edit: To sort out ties using mpg
column:
d <- data.table(mtcars, key="cyl") d.out <- d[, .SD[mpg %in% head(sort(unique(mpg)), 3)], by=cyl] # cyl mpg disp hp drat wt qsec vs am gear carb rank # 1: 4 22.8 108.0 93 3.85 2.320 18.61 1 1 4 1 11 # 2: 4 22.8 140.8 95 3.92 3.150 22.90 1 0 4 2 1 # 3: 4 21.5 120.1 97 3.70 2.465 20.01 1 0 3 1 8 # 4: 4 21.4 121.0 109 4.11 2.780 18.60 1 1 4 2 6 # 5: 6 18.1 225.0 105 2.76 3.460 20.22 1 0 3 1 7 # 6: 6 19.2 167.6 123 3.92 3.440 18.30 1 0 4 4 1 # 7: 6 17.8 167.6 123 3.92 3.440 18.90 1 0 4 4 2 # 8: 8 14.3 360.0 245 3.21 3.570 15.84 0 0 3 4 7 # 9: 8 10.4 472.0 205 2.93 5.250 17.98 0 0 3 4 14 # 10: 8 10.4 460.0 215 3.00 5.424 17.82 0 0 3 4 5 # 11: 8 13.3 350.0 245 3.73 3.840 15.41 0 0 3 4 3 # and for last N elements, of course it is straightforward d.out <- d[, .SD[mpg %in% tail(sort(unique(mpg)), 3)], by=cyl]
dplyr
does the trick
mtcars %>% arrange(desc(mpg)) %>% group_by(cyl) %>% slice(1:2) mpg cyl disp hp drat wt qsec vs am gear carb <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 2 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 3 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 5 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 6 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With