Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select rows with min value by group

Tags:

dataframe

r

I got a problems that bugs me for some time… hopefully anybody here can help me.

I got the following data frame

f <- c('a','a','b','b','b','c','d','d','d','d')
v1 <- c(1.3,10,2,10,10,1.1,10,3.1,10,10)
v2 <- c(1:10)
df <- data.frame(f,v1,v2)

f is a factor; v1 and v2 are values. For each level of f, I want only want to keep one row: the one that has the lowest value of v1 in this factor level.

f   v1  v2
a   1.3 1
b   2   3
c   1.1 6
d   3.1 8

I tried various things with aggregate, ddply, by, tapply… but nothing seems to work. For any suggestions, I would be very thankful.

like image 303
donodarazao Avatar asked Nov 15 '10 23:11

donodarazao


People also ask

How do I SELECT a row with minimum value in SQL?

To select data where a field has min value, you can use aggregate function min(). The syntax is as follows. SELECT *FROM yourTableName WHERE yourColumnName=(SELECT MIN(yourColumnName) FROM yourTableName);

Can we use SELECT * with GROUP BY?

Cannot use an aggregate or a subquery in an expression used for the group by list of a GROUP BY clause. The original idea was to create the table in beginning of the query, so the (SELECT * FROM #TBL) could be used on the query itself, instead of defining the names on each GROUP BY.

Is Min a grouping function?

The functions MAX, MIN and AVG can be used as GROUP BY functions. 2. Which of the following functions can be used without GROUP BY clause in SELECT query?


6 Answers

Using DWin's solution, tapply can be avoided using ave.

df[ df$v1 == ave(df$v1, df$f, FUN=min), ]

This gives another speed-up, as shown below. Mind you, this is also dependent on the number of levels. I give this as I notice that ave is far too often forgotten about, although it is one of the more powerful functions in R.

f <- rep(letters[1:20],10000)
v1 <- rnorm(20*10000)
v2 <- 1:(20*10000)
df <- data.frame(f,v1,v2)

> system.time(df[ df$v1 == ave(df$v1, df$f, FUN=min), ])
   user  system elapsed 
   0.05    0.00    0.05 

> system.time(df[ df$v1 %in% tapply(df$v1, df$f, min), ])
   user  system elapsed 
   0.25    0.03    0.29 

> system.time(lapply(split(df, df$f), FUN = function(x) {
+             vec <- which(x[3] == min(x[3]))
+             return(x[vec, ])
+         })
+  .... [TRUNCATED] 
   user  system elapsed 
   0.56    0.00    0.58 

> system.time(df[tapply(1:nrow(df),df$f,function(i) i[which.min(df$v1[i])]),]
+ )
   user  system elapsed 
   0.17    0.00    0.19 

> system.time( ddply(df, .var = "f", .fun = function(x) {
+     return(subset(x, v1 %in% min(v1)))
+     }
+ )
+ )
   user  system elapsed 
   0.28    0.00    0.28 
like image 80
Joris Meys Avatar answered Oct 27 '22 14:10

Joris Meys


A data.table solution.

library(data.table)
DT <- as.data.table(df)
DT[,.SD[which.min(v1)], by = f]

##   f  v1 v2
## 1: a 1.3  1
## 2: b 2.0  3
## 3: c 1.1  6
## 4: d 3.1  8

Or, more efficiently

DT[DT[,.I[which.min(v1)],by=f][['V1']]]

some benchmarking

f <- rep(letters[1:20],100000)
v1 <- rnorm(20*100000)
v2 <- 1:(20*100000)
df <- data.frame(f,v1,v2)
DT <- as.data.table(df)
f1<-function(){df2<-df[order(df$f,df$v1),]
               df2[!duplicated(df2$f),]}

f2<-function(){df2<-df[order(df$v1),]
               df2[!duplicated(df2$f),]}

f3<-function(){df[ df$v1 == ave(df$v1, df$f, FUN=min), ]}


f4 <- function(){DT[,.SD[which.min(v1)], by = f]}

f5 <- function(){DT[DT[,.I[which.min(v1)],by=f][['V1']]]}

library(microbenchmark)
microbenchmark(f1(),f2(),f3(),f4(), f5(),times = 5)
# Unit: milliseconds
# expr       min        lq    median        uq       max neval
# f1() 3254.6620 3265.4760 3286.5440 3411.4054 3475.4198     5
# f2() 1630.8572 1639.3472 1651.5422 1721.4670 1738.6684     5
# f3()  172.2639  174.0448  177.4985  179.9604  184.7365     5
# f4()  206.1837  209.8161  209.8584  210.4896  210.7893     5
# f5()  105.5960  106.5006  107.9486  109.7216  111.1286     5

The .I approach is the winner (FR #2330 will hopefully render the elegance of the .SD approach similarly fast when implemented).

like image 28
mnel Avatar answered Oct 27 '22 13:10

mnel


With plyr, I'd use:

ddply(df, .var = "f", .fun = function(x) {
    return(subset(x, v1 %in% min(v1)))
    }
)

Give that a try and see if it returns what you want.

like image 25
Matt Parker Avatar answered Oct 27 '22 13:10

Matt Parker


Another tapply solution, with no unnecessary scanning of vector with %in%:

df[tapply(1:nrow(df),df$f,function(i) i[which.min(df$v1[i])]),]

EDIT: This will left only first row in case of a tie.

EDIT2: Impressed by ave, I've made additional improvements:

df[sapply(split(1:nrow(df),df$f),function(x) x[which.min(df$v1[x])]),]

On my machine (using Joris' benchmark data):

> system.time(df[ df$v1 == ave(df$v1, df$f, FUN=min), ])
   user  system elapsed
  0.022   0.000   0.021
> system.time(df[sapply(split(1:nrow(df),df$f),function(x) x[which.min(df$v1[x])]),])
   user  system elapsed
  0.006   0.000   0.007
like image 35
mbq Avatar answered Oct 27 '22 12:10

mbq


This is the dplyr-way to filter for the minimum v1 values by groups of f:

require(dplyr)
df %>%
  group_by(f) %>%
  filter(v1 == min(v1))

#Source: local data frame [4 x 3]
#Groups: f
#
#  f  v1 v2
#1 a 1.3  1
#2 b 2.0  3
#3 c 1.1  6
#4 d 3.1  8

In cases of ties in v1, this would result in multiple rows per group of f. If you want to avoid that, you can use:

df %>% 
  group_by(f) %>% 
  filter(rank(v1, ties.method= "first") == 1)

This way, you'll only get the first row in case of ties. You could alternatively use ties.method = "random" or others as described in the help file.

like image 5
talat Avatar answered Oct 27 '22 12:10

talat


Here's a tapply solution;

> df[ df$v1 %in% tapply(df$v1, df$f, min), ]

  f  v1 v2
1 a 1.3  1
3 b 2.0  3
6 c 1.1  6
8 d 3.1  8

In your example it only picks out one per group, but if there were ties this method would show them all. (As would Parker's and Luštrik's I suspect.)

like image 4
IRTFM Avatar answered Oct 27 '22 14:10

IRTFM