I would like to deduplicate my data, keeping the row that has the most frequent appearances. If there is a tie in rows, I don't care which gets returned—the first in alphabetical or numeric order is fine. I would like to do this by group of id
and var
.
MRE:
df <- data.frame(
id = rep("a", 8),
var = c(rep("b", 4), rep("c", 4)),
val = c("d", "d", "d", "e", "f", "f", "g", "g")
)
> df
id var val
1 a b d
2 a b d
3 a b d
4 a b e
5 a c f
6 a c f
7 a c g
8 a c g
Should be:
id var val
1 a b d
2 a c f
I'm working with large datasets and tidyverse
pipe chains, so a dplyr
solution would be preferable.
Use table
and which.max
to extract the mode
:
df %>%
group_by(id, var) %>%
summarise(val = {t <- table(val); names(t)[which.max(t)] })
# A tibble: 2 x 3
# Groups: id [?]
# id var val
# <fct> <fct> <chr>
#1 a b d
#2 a c f
Another way to do this in base R: Create a three way contingency table directly, and then find the max column along the third axis:
apply(table(df), c(1, 2), function(v) names(v)[which.max(v)])
# var
#id b c
# a "d" "f"
Convert this to a data frame:
as.data.frame.table(
apply(table(df), c(1, 2), function(v) names(v)[which.max(v)])
)
# id var Freq
#1 a b d
#2 a c f
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With