Let's say I have a dataframe with the following structure:
id A B
1 1 1
1 1 2
1 1 2
1 2 2
1 2 3
1 2 4
1 2 5
2 1 2
2 2 2
2 3 2
2 3 5
2 3 5
2 4 6
I'd like to get the most common combination of values in A
and B
for each id
:
id A B
1 1 2
2 3 5
I need to do this for a fairly big dataset (several million rows). I've got to a couple of horrible, slow, and very un-idiomatic solutions; I'm sure there is an easy, R-ish way.
I think I should be using aggregate
, but I can't find a way to do it that works:
> aggregate(cbind(A, B) ~ id, d, Mode)
id A B
1 1 2 2
2 2 3 2
> # wrong!
> aggregate(interaction(A, B) ~ id, d, Mode)
id interaction(A, B)
1 1 1.2
2 2 3.5
> # close, but I need the original columns
Using dplyr:
library(dplyr)
df %>%
group_by(id, A, B) %>%
mutate(n = n()) %>%
group_by(id) %>%
slice(which.max(n)) %>%
select(-n)
#Source: local data frame [2 x 3]
#Groups: id
#
# id A B
#1 1 1 2
#2 2 3 5
And a similar data.table approach:
library(data.table)
setDT(df)[, .N, by=.(id, A, B)][, .SD[which.max(N)], by = id]
# id A B N
#1: 1 1 2 2
#2: 2 3 5 2
Edit to include a brief explanation:
Both approaches do essentially the same:
In the data.table version, you start with setDT(df)
to convert the data.frame to a data.table object.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With