I got a problems that bugs me for some time… hopefully anybody here can help me.
I got the following data frame
f <- c('a','a','b','b','b','c','d','d','d','d')
v1 <- c(1.3,10,2,10,10,1.1,10,3.1,10,10)
v2 <- c(1:10)
df <- data.frame(f,v1,v2)
f is a factor; v1 and v2 are values. For each level of f, I want only want to keep one row: the one that has the lowest value of v1 in this factor level.
f v1 v2
a 1.3 1
b 2 3
c 1.1 6
d 3.1 8
I tried various things with aggregate, ddply, by, tapply… but nothing seems to work. For any suggestions, I would be very thankful.
To select data where a field has min value, you can use aggregate function min(). The syntax is as follows. SELECT *FROM yourTableName WHERE yourColumnName=(SELECT MIN(yourColumnName) FROM yourTableName);
Cannot use an aggregate or a subquery in an expression used for the group by list of a GROUP BY clause. The original idea was to create the table in beginning of the query, so the (SELECT * FROM #TBL) could be used on the query itself, instead of defining the names on each GROUP BY.
The functions MAX, MIN and AVG can be used as GROUP BY functions. 2. Which of the following functions can be used without GROUP BY clause in SELECT query?
Using DWin's solution, tapply
can be avoided using ave
.
df[ df$v1 == ave(df$v1, df$f, FUN=min), ]
This gives another speed-up, as shown below. Mind you, this is also dependent on the number of levels. I give this as I notice that ave
is far too often forgotten about, although it is one of the more powerful functions in R.
f <- rep(letters[1:20],10000)
v1 <- rnorm(20*10000)
v2 <- 1:(20*10000)
df <- data.frame(f,v1,v2)
> system.time(df[ df$v1 == ave(df$v1, df$f, FUN=min), ])
user system elapsed
0.05 0.00 0.05
> system.time(df[ df$v1 %in% tapply(df$v1, df$f, min), ])
user system elapsed
0.25 0.03 0.29
> system.time(lapply(split(df, df$f), FUN = function(x) {
+ vec <- which(x[3] == min(x[3]))
+ return(x[vec, ])
+ })
+ .... [TRUNCATED]
user system elapsed
0.56 0.00 0.58
> system.time(df[tapply(1:nrow(df),df$f,function(i) i[which.min(df$v1[i])]),]
+ )
user system elapsed
0.17 0.00 0.19
> system.time( ddply(df, .var = "f", .fun = function(x) {
+ return(subset(x, v1 %in% min(v1)))
+ }
+ )
+ )
user system elapsed
0.28 0.00 0.28
A data.table
solution.
library(data.table)
DT <- as.data.table(df)
DT[,.SD[which.min(v1)], by = f]
## f v1 v2
## 1: a 1.3 1
## 2: b 2.0 3
## 3: c 1.1 6
## 4: d 3.1 8
Or, more efficiently
DT[DT[,.I[which.min(v1)],by=f][['V1']]]
f <- rep(letters[1:20],100000)
v1 <- rnorm(20*100000)
v2 <- 1:(20*100000)
df <- data.frame(f,v1,v2)
DT <- as.data.table(df)
f1<-function(){df2<-df[order(df$f,df$v1),]
df2[!duplicated(df2$f),]}
f2<-function(){df2<-df[order(df$v1),]
df2[!duplicated(df2$f),]}
f3<-function(){df[ df$v1 == ave(df$v1, df$f, FUN=min), ]}
f4 <- function(){DT[,.SD[which.min(v1)], by = f]}
f5 <- function(){DT[DT[,.I[which.min(v1)],by=f][['V1']]]}
library(microbenchmark)
microbenchmark(f1(),f2(),f3(),f4(), f5(),times = 5)
# Unit: milliseconds
# expr min lq median uq max neval
# f1() 3254.6620 3265.4760 3286.5440 3411.4054 3475.4198 5
# f2() 1630.8572 1639.3472 1651.5422 1721.4670 1738.6684 5
# f3() 172.2639 174.0448 177.4985 179.9604 184.7365 5
# f4() 206.1837 209.8161 209.8584 210.4896 210.7893 5
# f5() 105.5960 106.5006 107.9486 109.7216 111.1286 5
The .I
approach is the winner (FR #2330 will hopefully render the elegance of the .SD
approach similarly fast when implemented).
With plyr
, I'd use:
ddply(df, .var = "f", .fun = function(x) {
return(subset(x, v1 %in% min(v1)))
}
)
Give that a try and see if it returns what you want.
Another tapply
solution, with no unnecessary scanning of vector with %in%
:
df[tapply(1:nrow(df),df$f,function(i) i[which.min(df$v1[i])]),]
EDIT: This will left only first row in case of a tie.
EDIT2: Impressed by ave
, I've made additional improvements:
df[sapply(split(1:nrow(df),df$f),function(x) x[which.min(df$v1[x])]),]
On my machine (using Joris' benchmark data):
> system.time(df[ df$v1 == ave(df$v1, df$f, FUN=min), ])
user system elapsed
0.022 0.000 0.021
> system.time(df[sapply(split(1:nrow(df),df$f),function(x) x[which.min(df$v1[x])]),])
user system elapsed
0.006 0.000 0.007
This is the dplyr-way to filter for the minimum v1
values by groups of f
:
require(dplyr)
df %>%
group_by(f) %>%
filter(v1 == min(v1))
#Source: local data frame [4 x 3]
#Groups: f
#
# f v1 v2
#1 a 1.3 1
#2 b 2.0 3
#3 c 1.1 6
#4 d 3.1 8
In cases of ties in v1
, this would result in multiple rows per group of f
. If you want to avoid that, you can use:
df %>%
group_by(f) %>%
filter(rank(v1, ties.method= "first") == 1)
This way, you'll only get the first row in case of ties. You could alternatively use ties.method = "random"
or others as described in the help file.
Here's a tapply solution;
> df[ df$v1 %in% tapply(df$v1, df$f, min), ]
f v1 v2
1 a 1.3 1
3 b 2.0 3
6 c 1.1 6
8 d 3.1 8
In your example it only picks out one per group, but if there were ties this method would show them all. (As would Parker's and Luštrik's I suspect.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With