Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove duplicates, keeping most frequent row

Tags:

r

dplyr

I would like to deduplicate my data, keeping the row that has the most frequent appearances. If there is a tie in rows, I don't care which gets returned—the first in alphabetical or numeric order is fine. I would like to do this by group of id and var.

MRE:

df <- data.frame(
  id = rep("a", 8),
  var = c(rep("b", 4), rep("c", 4)),
  val = c("d", "d", "d", "e", "f", "f", "g", "g")
)

> df
  id var val
1  a   b   d
2  a   b   d
3  a   b   d
4  a   b   e
5  a   c   f
6  a   c   f
7  a   c   g
8  a   c   g

Should be:

  id var val
1  a   b   d
2  a   c   f

I'm working with large datasets and tidyverse pipe chains, so a dplyr solution would be preferable.

like image 387
Mark White Avatar asked Dec 18 '22 00:12

Mark White


1 Answers

Use table and which.max to extract the mode:

df %>% 
    group_by(id, var) %>% 
    summarise(val = {t <- table(val); names(t)[which.max(t)] })

# A tibble: 2 x 3
# Groups:   id [?]
#  id    var   val  
#  <fct> <fct> <chr>
#1 a     b     d    
#2 a     c     f  

Another way to do this in base R: Create a three way contingency table directly, and then find the max column along the third axis:

apply(table(df), c(1, 2), function(v) names(v)[which.max(v)])

#   var
#id  b   c  
#  a "d" "f"

Convert this to a data frame:

as.data.frame.table(
    apply(table(df), c(1, 2), function(v) names(v)[which.max(v)])
)

#  id var Freq
#1  a   b    d
#2  a   c    f
like image 191
Psidom Avatar answered Jan 05 '23 04:01

Psidom