Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I get the most common combination of several columns, aggregating by others, in a data.frame?

Tags:

dataframe

r

Let's say I have a dataframe with the following structure:

id  A  B
 1  1  1
 1  1  2
 1  1  2
 1  2  2
 1  2  3
 1  2  4
 1  2  5
 2  1  2
 2  2  2
 2  3  2 
 2  3  5 
 2  3  5 
 2  4  6

I'd like to get the most common combination of values in A and B for each id:

id  A  B
 1  1  2
 2  3  5

I need to do this for a fairly big dataset (several million rows). I've got to a couple of horrible, slow, and very un-idiomatic solutions; I'm sure there is an easy, R-ish way.

I think I should be using aggregate, but I can't find a way to do it that works:

> aggregate(cbind(A, B) ~ id, d, Mode)
  id A B
1  1 2 2
2  2 3 2  
> # wrong!

> aggregate(interaction(A, B) ~ id, d, Mode)
id interaction(A, B)
1  1               1.2
2  2               3.5
> # close, but I need the original columns
like image 994
jesusiniesta Avatar asked Mar 18 '23 01:03

jesusiniesta


1 Answers

Using dplyr:

library(dplyr)
df %>% 
  group_by(id, A, B) %>%
  mutate(n = n()) %>%
  group_by(id) %>%
  slice(which.max(n)) %>%
  select(-n)

#Source: local data frame [2 x 3]
#Groups: id
#
#  id A B
#1  1 1 2
#2  2 3 5

And a similar data.table approach:

library(data.table)
setDT(df)[, .N, by=.(id, A, B)][, .SD[which.max(N)], by = id]
#   id A B N
#1:  1 1 2 2
#2:  2 3 5 2

Edit to include a brief explanation:

Both approaches do essentially the same:

  • group the data by id, A and B.
  • Add a column with the number of rows per group
  • group the data by id (only) and return the (first) maximum group per id.

In the data.table version, you start with setDT(df) to convert the data.frame to a data.table object.

like image 192
talat Avatar answered Mar 19 '23 16:03

talat