Extract row corresponding to minimum value of a variable by group

Tags:

I wish to (1) group data by one variable (State), (2) within each group find the row of minimum value of another variable (Employees), and (3) extract the entire row.

(1) and (2) are easy one-liners, and I feel like (3) should be too, but I can't get it.

Here is a sample data set:

> data
  State Company Employees
1    AK       A        82
2    AK       B       104
3    AK       C        37
4    AK       D        24
5    RI       E        19
6    RI       F       118
7    RI       G        88
8    RI       H        42

data <- structure(list(State = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 
        2L), .Label = c("AK", "RI"), class = "factor"), Company = structure(1:8, .Label = c("A", 
        "B", "C", "D", "E", "F", "G", "H"), class = "factor"), Employees = c(82L, 
        104L, 37L, 24L, 19L, 118L, 88L, 42L)), .Names = c("State", "Company", 
        "Employees"), class = "data.frame", row.names = c(NA, -8L))

Calculate min by group is easy, using aggregate:

> aggregate(Employees ~ State, data, function(x) min(x))
  State Employees
1    AK        24
2    RI        19

...or data.table:

> library(data.table)
> DT <- data.table(data)
> DT[ , list(Employees = min(Employees)), by = State]
   State Employees
1:    AK        24
2:    RI        19

But how do I extract the entire row corresponding to these min values, i.e. also including Company in the result?

594

asked Jun 05 '14 21:06

3 Answers

Slightly more elegant:

library(data.table)
DT[ , .SD[which.min(Employees)], by = State]

   State Company Employees
1:    AK       D        24
2:    RI       E        19

Slighly less elegant than using .SD, but a bit faster (for data with many groups):

DT[DT[ , .I[which.min(Employees)], by = State]$V1]

Also, just replace the expression which.min(Employees) with Employees == min(Employees), if your data set has multiple identical min values and you'd like to subset all of them.

See also Subset rows corresponding to max value by group using data.table.

answered Oct 19 '22 03:10

agstudy

As this is Google's top hit, I thought I would add some additional options which I find useful to know. The idea is basically to arrange once by Employees and then just take the uniques per State

Either using data.table

library(data.table)
unique(setDT(data)[order(Employees)], by = "State")
#    State Company Employees
# 1:    RI       E        19
# 2:    AK       D        24

Alternatively, we could also first order and then subset .SD. Both of those operations were optimized in the resent data.table versions and order is seemingly triggers data.table:::forderv, while .SD[1L] triggers Gforce

setDT(data)[order(Employees), .SD[1L], by = State, verbose = TRUE] # <- Added verbose
# order optimisation is on, i changed from 'order(...)' to 'forder(DT, ...)'.
# i clause present and columns used in by detected, only these subset: State 
# Finding groups using forderv ... 0 sec
# Finding group sizes from the positions (can be avoided to save RAM) ... 0 sec
# Getting back original order ... 0 sec
# lapply optimization changed j from '.SD[1L]' to 'list(Company[1L], Employees[1L])'
# GForce optimized j to 'list(`g[`(Company, 1L), `g[`(Employees, 1L))'
# Making each group and running j (GForce TRUE) ... 0 secs
#    State Company Employees
# 1:    RI       E        19
# 2:    AK       D        24

Or dplyr

library(dplyr)
data %>% 
  arrange(Employees) %>% 
  distinct(State, .keep_all = TRUE)
#   State Company Employees
# 1    RI       E        19
# 2    AK       D        24

Another interesting idea borrowed from @Khashaas awesome answer (with a small modification in form of mult = "first" in order to handle multiple matches) is to first find minimum per group and then perform a binary join back. The advantage of this is both the utilization of data.tables gmin function (which skips the evaluation overhead) and the binary join feature

tmp <- setDT(data)[, .(Employees = min(Employees)), by = State]
data[tmp, on = .(State, Employees), mult = "first"]
#    State Company Employees
# 1:    AK       D        24
# 2:    RI       E        19

Some benchmarks

library(data.table)
library(dplyr)
library(plyr)
library(stringi)
library(microbenchmark)

set.seed(123)
N <- 1e6
data <- data.frame(State = stri_rand_strings(N, 2, '[A-Z]'),
                   Employees = sample(N*10, N, replace = TRUE))
DT <- copy(data)
setDT(DT)
DT2 <- copy(DT)
str(DT)
str(DT2)

microbenchmark("(data.table) .SD[which.min]: " = DT[ , .SD[which.min(Employees)], by = State],
               "(data.table) .I[which.min]: " = DT[DT[ , .I[which.min(Employees)], by = State]$V1],
               "(data.table) order/unique: " = unique(DT[order(Employees)], by = "State"),
               "(data.table) order/.SD[1L]: " = DT[order(Employees), .SD[1L], by = State],
               "(data.table) self join (on):" = {
                 tmp <- DT[, .(Employees = min(Employees)), by = State]
                 DT[tmp, on = .(State, Employees), mult = "first"]},
               "(data.table) self join (setkey):" = {
                 tmp <- DT2[, .(Employees = min(Employees)), by = State] 
                 setkey(tmp, State, Employees)
                 setkey(DT2, State, Employees)
                 DT2[tmp, mult = "first"]},
               "(dplyr) slice(which.min): " = data %>% group_by(State) %>% slice(which.min(Employees)),
               "(dplyr) arrange/distinct: " = data %>% arrange(Employees) %>% distinct(State, .keep_all = TRUE),
               "(dplyr) arrange/group_by/slice: " = data %>% arrange(Employees) %>% group_by(State) %>% slice(1),
               "(plyr) ddply/which.min: " = ddply(data, .(State), function(x) x[which.min(x$Employees),]),
               "(base) by: " = do.call(rbind, by(data, data$State, function(x) x[which.min(x$Employees), ])))


# Unit: milliseconds
#                             expr        min         lq       mean     median         uq       max neval      cld
#    (data.table) .SD[which.min]:   119.66086  125.49202  145.57369  129.61172  152.02872  267.5713   100    d    
#     (data.table) .I[which.min]:    12.84948   13.66673   19.51432   13.97584   15.17900  109.5438   100 a       
#      (data.table) order/unique:    52.91915   54.63989   64.39212   59.15254   61.71133  177.1248   100  b      
#     (data.table) order/.SD[1L]:    51.41872   53.22794   58.17123   55.00228   59.00966  145.0341   100  b      
#     (data.table) self join (on):   44.37256   45.67364   50.32378   46.24578   50.69411  137.4724   100  b      
# (data.table) self join (setkey):   14.30543   15.28924   18.63739   15.58667   16.01017  106.0069   100 a       
#       (dplyr) slice(which.min):    82.60453   83.64146   94.06307   84.82078   90.09772  186.0848   100   c     
#       (dplyr) arrange/distinct:   344.81603  360.09167  385.52661  379.55676  395.29463  491.3893   100     e   
# (dplyr) arrange/group_by/slice:   367.95924  383.52719  414.99081  397.93646  425.92478  557.9553   100      f  
#         (plyr) ddply/which.min:   506.55354  530.22569  568.99493  552.65068  601.04582  727.9248   100       g 
#                      (base) by:  1220.38286 1291.70601 1340.56985 1344.86291 1382.38067 1512.5377   100        h

answered Oct 19 '22 05:10

David Arenburg

Related questions
                            
                                How to replace NA values in a table for selected columns
                            
                                Removing empty rows of a data file in R
                            
                                How can I create a correlation matrix in R?
                            
                                Sankey Diagrams in R?
                            
                                Read all worksheets in an Excel workbook into an R list with data.frames
                            
                                Counting unique / distinct values by group in a data frame
                            
                                equivalent of a python dict in R
                            
                                How do I get the classes of all columns in a data frame? [duplicate]
                            
                                Combine legends for color and shape into a single legend
                            
                                R Error in x$ed : $ operator is invalid for atomic vectors
                            
                                Extracting time from POSIXct
                            
                                Fitting polynomial model to data in R
                            
                                Unique on a dataframe with only selected columns
                            
                                Get filename without extension in R
                            
                                How to delete columns that contain ONLY NAs?
                            
                                Using roxygen2 and doxygen on the same package? [closed]
                            
                                Save plots made in a shiny app
                            
                                Getting the last n elements of a vector. Is there a better way than using the length() function?
                            
                                Sum rows in data.frame or matrix
                            
                                ggplot plots in scripts do not display in Rstudio

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Extract row corresponding to minimum value of a variable by group

Tags:

r

aggregate

data.table

dplyr

Ed Swindelles

People also ask

3 Answers

Señor O

agstudy

David Arenburg

Recent Activity

Donate For Us