I have a data.frame with several columns (17). Column 2 have several rows with the same value, I want to keep only one of those rows, specifically the one that has the maximum value in column 17.
For example:
A B
'a' 1
'a' 2
'a' 3
'b' 5
'b' 200
Would return
A B
'a' 3
'b' 200
(plus the rest of the columns)
So far I've been using the unique function, but I think it randomly keeps one or keeps just the first one that appears.
** UPDATE ** The real data has 376000 rows. I've tried the data.table and ddply suggestions but they take forever. Any idea which is the most efficient?
A solution using package data.table
:
set.seed(42)
dat <- data.frame(A=c('a','a','a','b','b'),B=c(1,2,3,5,200),C=rnorm(5))
library(data.table)
dat <- as.data.table(dat)
dat[,.SD[which.max(B)],by=A]
A B C
1: a 3 0.3631284
2: b 200 0.4042683
A not so elegant solution using R base functions
> ind <- with(dat, tapply(B, A, which.max)) # Using @Roland's data
> mysplit <- split(dat, dat$A)
> do.call(rbind, lapply(1:length(mysplit), function(i) mysplit[[i]][ind[i],]))
A B C
3 a 3 0.3631284
5 b 200 0.4042683
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With