Fastest way of determining most frequent factor in a grouped data frame in dplyr

I am trying to find the most frequent value within a group for several factor variables while summarizing a data frame in dplyr. I need a formula that does the following:

Find the most frequently used factor level among all factors for one variable in a group (so basically "max()" for counts of factor levels).
If there is a tie between several most-used-factor levels, pick any one of those factors-levels.
Return the factor-level name (not number of counts).

There are several formulas that work. However, those that I could think of are all slow. Those that are fast are not convenient to apply to several variables in a data frame at once. I was wondering if somebody knows a fast method that integrates nicely with dplyr.

I tried the following:

generating sample data (50000 groups with 100 random letters)

z <- data.frame(a = rep(1:50000,100), b = sample(LETTERS, 5000000, replace = TRUE))

str(z)
'data.frame':   5000000 obs. of  2 variables:
$ a: int  1 2 3 4 5 6 7 8 9 10 ...
$ b: Factor w/ 26 levels "A","B","C","D",..: 6 4 14 12 3 19 17 19 15 20 ...

"Clean"-but-slow approach 1

 y <- z %>% 
    group_by(a) %>% 
    summarise(c = names(table(b))[which.max(table(b))])

user    system  elapsed 
26.772  2.011   29.568

"Clean"-but-slow approach 2

y <- z %>% 
    group_by(a) %>% 
    summarise(c = names(which(table(b) == max(table(b)))[1]))

user    system  elapsed 
29.329  2.029   32.361

"Clean"-but-slow approach 3

y <- z %>% 
    group_by(a) %>% 
    summarise(c = names(sort(table(b),decreasing = TRUE)[1]))

user    system  elapsed 
35.086  6.905   42.485

"Messy"-but-fast approach

y <- z %>% 
     group_by(a,b) %>% 
     summarise(counter = n()) %>% 
     group_by(a) %>% 
     filter(counter == max(counter))
y <- y[!duplicated(y$a),]
y <- y$counter <- NULL

user   system  elapsed 
7.061  0.330   7.664

How do I find the most frequent value of a data set in R?

To find the most frequent factor value in an R data frame column, we can use names function with which. max function after creating the table for the particular column. This might be required while doing factorial analysis and we want to know which factor occurs the most.

What Dplyr function do you use to pick observations by their values?

filter() : pick observations by their values. select() : pick variables by their names.

How do you factor multiple variables in R?

In R, you can convert multiple numeric variables to factor using lapply function. The lapply function is a part of apply family of functions. They perform multiple iterations (loops) in R. In R, categorical variables need to be set as factor variables.

Here's another option with dplyr:

set.seed(123)
z <- data.frame(a = rep(1:50000,100), 
                b = sample(LETTERS, 5000000, replace = TRUE), 
                stringsAsFactors = FALSE)

a <- z %>% group_by(a, b) %>% summarise(c=n()) %>% filter(row_number(desc(c))==1) %>% .$b 
b <- z %>% group_by(a) %>% summarise(c=names(which(table(b) == max(table(b)))[1])) %>% .$c

We make sure these are equivalent approaches:

> identical(a, b)
#[1] TRUE

Update

As per mentioned by @docendodiscimus, you could also do:

count(z, a, b) %>% slice(which.max(n))

Here are the results on the benchmark:

library(microbenchmark)
mbm <- microbenchmark(
  steven = z %>% group_by(a, b) %>% summarise(c = n()) %>% filter(row_number(desc(c))==1),
  phil = z %>% group_by(a) %>% summarise(c = names(which(table(b) == max(table(b)))[1])),
  docendo = count(z, a, b) %>% slice(which.max(n)),
  times = 10
)

enter image description here

#Unit: seconds
#    expr       min        lq      mean    median        uq       max neval cld
#  steven  4.752168  4.789564  4.815986  4.813686  4.847964  4.875109    10  b 
#    phil 15.356051 15.378914 15.467534 15.458844 15.533385 15.606690    10   c
# docendo  4.586096  4.611401  4.669375  4.688420  4.702352  4.753583    10 a

Fastest way of determining most frequent factor in a grouped data frame in dplyr

Tags:

performance

r

dplyr

Phil

People also ask

1 Answers

Steven Beaupré

Recent Activity

Donate For Us

Fastest way of determining most frequent factor in a grouped data frame in dplyr

Tags:

performance

r

dplyr

Phil

People also ask

1 Answers

Steven Beaupré

Related questions

Recent Activity

Donate For Us