Select rows with min value by group

Tags:

r

I got a problems that bugs me for some time… hopefully anybody here can help me.

I got the following data frame

f <- c('a','a','b','b','b','c','d','d','d','d')
v1 <- c(1.3,10,2,10,10,1.1,10,3.1,10,10)
v2 <- c(1:10)
df <- data.frame(f,v1,v2)

f is a factor; v1 and v2 are values. For each level of f, I want only want to keep one row: the one that has the lowest value of v1 in this factor level.

Click to copy

I tried various things with aggregate, ddply, by, tapply… but nothing seems to work. For any suggestions, I would be very thankful.

303

asked Nov 15 '10 23:11

6 Answers

Using DWin's solution, tapply can be avoided using ave.

Click to copy

df[ df$v1 == ave(df$v1, df$f, FUN=min), ]

This gives another speed-up, as shown below. Mind you, this is also dependent on the number of levels. I give this as I notice that ave is far too often forgotten about, although it is one of the more powerful functions in R.

Click to copy

f <- rep(letters[1:20],10000)
v1 <- rnorm(20*10000)
v2 <- 1:(20*10000)
df <- data.frame(f,v1,v2)

> system.time(df[ df$v1 == ave(df$v1, df$f, FUN=min), ])
   user  system elapsed 
   0.05    0.00    0.05 

> system.time(df[ df$v1 %in% tapply(df$v1, df$f, min), ])
   user  system elapsed 
   0.25    0.03    0.29 

> system.time(lapply(split(df, df$f), FUN = function(x) {
+             vec <- which(x[3] == min(x[3]))
+             return(x[vec, ])
+         })
+  .... [TRUNCATED] 
   user  system elapsed 
   0.56    0.00    0.58 

> system.time(df[tapply(1:nrow(df),df$f,function(i) i[which.min(df$v1[i])]),]
+ )
   user  system elapsed 
   0.17    0.00    0.19 

> system.time( ddply(df, .var = "f", .fun = function(x) {
+     return(subset(x, v1 %in% min(v1)))
+     }
+ )
+ )
   user  system elapsed 
   0.28    0.00    0.28

answered Oct 27 '22 14:10

Joris Meys

A data.table solution.

Click to copy

library(data.table)
DT <- as.data.table(df)
DT[,.SD[which.min(v1)], by = f]

##   f  v1 v2
## 1: a 1.3  1
## 2: b 2.0  3
## 3: c 1.1  6
## 4: d 3.1  8

Or, more efficiently

Click to copy

DT[DT[,.I[which.min(v1)],by=f][['V1']]]

some benchmarking

Click to copy

f <- rep(letters[1:20],100000)
v1 <- rnorm(20*100000)
v2 <- 1:(20*100000)
df <- data.frame(f,v1,v2)
DT <- as.data.table(df)
f1<-function(){df2<-df[order(df$f,df$v1),]
               df2[!duplicated(df2$f),]}

f2<-function(){df2<-df[order(df$v1),]
               df2[!duplicated(df2$f),]}

f3<-function(){df[ df$v1 == ave(df$v1, df$f, FUN=min), ]}


f4 <- function(){DT[,.SD[which.min(v1)], by = f]}

f5 <- function(){DT[DT[,.I[which.min(v1)],by=f][['V1']]]}

library(microbenchmark)
microbenchmark(f1(),f2(),f3(),f4(), f5(),times = 5)
# Unit: milliseconds
# expr       min        lq    median        uq       max neval
# f1() 3254.6620 3265.4760 3286.5440 3411.4054 3475.4198     5
# f2() 1630.8572 1639.3472 1651.5422 1721.4670 1738.6684     5
# f3()  172.2639  174.0448  177.4985  179.9604  184.7365     5
# f4()  206.1837  209.8161  209.8584  210.4896  210.7893     5
# f5()  105.5960  106.5006  107.9486  109.7216  111.1286     5

The .I approach is the winner (FR #2330 will hopefully render the elegance of the .SD approach similarly fast when implemented).

answered Oct 27 '22 13:10

mnel

With plyr, I'd use:

Click to copy

ddply(df, .var = "f", .fun = function(x) {
    return(subset(x, v1 %in% min(v1)))
    }
)

Give that a try and see if it returns what you want.

answered Oct 27 '22 13:10

Matt Parker

Another tapply solution, with no unnecessary scanning of vector with %in%:

Click to copy

df[tapply(1:nrow(df),df$f,function(i) i[which.min(df$v1[i])]),]

EDIT: This will left only first row in case of a tie.

EDIT2: Impressed by ave, I've made additional improvements:

Click to copy

df[sapply(split(1:nrow(df),df$f),function(x) x[which.min(df$v1[x])]),]

On my machine (using Joris' benchmark data):

Click to copy

> system.time(df[ df$v1 == ave(df$v1, df$f, FUN=min), ])
   user  system elapsed
  0.022   0.000   0.021
> system.time(df[sapply(split(1:nrow(df),df$f),function(x) x[which.min(df$v1[x])]),])
   user  system elapsed
  0.006   0.000   0.007

answered Oct 27 '22 12:10

mbq

This is the dplyr-way to filter for the minimum v1 values by groups of f:

Click to copy

require(dplyr)
df %>%
  group_by(f) %>%
  filter(v1 == min(v1))

#Source: local data frame [4 x 3]
#Groups: f
#
#  f  v1 v2
#1 a 1.3  1
#2 b 2.0  3
#3 c 1.1  6
#4 d 3.1  8

In cases of ties in v1, this would result in multiple rows per group of f. If you want to avoid that, you can use:

Click to copy

df %>% 
  group_by(f) %>% 
  filter(rank(v1, ties.method= "first") == 1)

This way, you'll only get the first row in case of ties. You could alternatively use ties.method = "random" or others as described in the help file.

answered Oct 27 '22 12:10

talat

Here's a tapply solution;

Click to copy

> df[ df$v1 %in% tapply(df$v1, df$f, min), ]

  f  v1 v2
1 a 1.3  1
3 b 2.0  3
6 c 1.1  6
8 d 3.1  8

In your example it only picks out one per group, but if there were ties this method would show them all. (As would Parker's and Luštrik's I suspect.)

answered Oct 27 '22 14:10

IRTFM

Related questions
                            
                                R Markdown Math Equation Alignment
                            
                                How to include RMarkdown file in r package? [duplicate]
                            
                                Combine lists while overriding values with same name in R
                            
                                Paste/Collapse in R
                            
                                How to rank within groups in R?
                            
                                Writing to specific schemas with RPostgreSQL
                            
                                What is the difference between NaN and Inf, and NULL and NA in R?
                            
                                Alternatives to nested ifelse statements in R
                            
                                Shaded area under two curves using R
                            
                                How can I use the row.names attribute to order the rows of my dataframe in R?
                            
                                Extract numeric part of strings of mixed numbers and characters in R
                            
                                Count leading zeros between the decimal point and first nonzero digit
                            
                                Are there any good R object browsers?
                            
                                How to set the row names of a data frame passed on with the pipe %>% operator?
                            
                                Warning in install.packages: unable to move temporary installation
                            
                                Get the number of lines in a text file using R
                            
                                R: how to total the number of NA in each col of data.frame
                            
                                ggplot x-axis labels with all x-axis values
                            
                                Plotting the average values for each level in ggplot2
                            
                                How to add boxplots to scatterplot with jitter

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Select rows with min value by group

Tags:

dataframe

r

donodarazao

People also ask

6 Answers

Joris Meys

some benchmarking

mnel

Matt Parker

mbq

talat

IRTFM

Recent Activity

Donate For Us