I am trying to find the most frequent value within a group for several factor variables while summarizing a data frame in dplyr. I need a formula that does the following:
There are several formulas that work. However, those that I could think of are all slow. Those that are fast are not convenient to apply to several variables in a data frame at once. I was wondering if somebody knows a fast method that integrates nicely with dplyr.
I tried the following:
generating sample data (50000 groups with 100 random letters)
z <- data.frame(a = rep(1:50000,100), b = sample(LETTERS, 5000000, replace = TRUE))
str(z)
'data.frame': 5000000 obs. of 2 variables:
$ a: int 1 2 3 4 5 6 7 8 9 10 ...
$ b: Factor w/ 26 levels "A","B","C","D",..: 6 4 14 12 3 19 17 19 15 20 ...
"Clean"-but-slow approach 1
y <- z %>%
group_by(a) %>%
summarise(c = names(table(b))[which.max(table(b))])
user system elapsed
26.772 2.011 29.568
"Clean"-but-slow approach 2
y <- z %>%
group_by(a) %>%
summarise(c = names(which(table(b) == max(table(b)))[1]))
user system elapsed
29.329 2.029 32.361
"Clean"-but-slow approach 3
y <- z %>%
group_by(a) %>%
summarise(c = names(sort(table(b),decreasing = TRUE)[1]))
user system elapsed
35.086 6.905 42.485
"Messy"-but-fast approach
y <- z %>%
group_by(a,b) %>%
summarise(counter = n()) %>%
group_by(a) %>%
filter(counter == max(counter))
y <- y[!duplicated(y$a),]
y <- y$counter <- NULL
user system elapsed
7.061 0.330 7.664
To find the most frequent factor value in an R data frame column, we can use names function with which. max function after creating the table for the particular column. This might be required while doing factorial analysis and we want to know which factor occurs the most.
filter() : pick observations by their values. select() : pick variables by their names.
In R, you can convert multiple numeric variables to factor using lapply function. The lapply function is a part of apply family of functions. They perform multiple iterations (loops) in R. In R, categorical variables need to be set as factor variables.
Here's another option with dplyr
:
set.seed(123)
z <- data.frame(a = rep(1:50000,100),
b = sample(LETTERS, 5000000, replace = TRUE),
stringsAsFactors = FALSE)
a <- z %>% group_by(a, b) %>% summarise(c=n()) %>% filter(row_number(desc(c))==1) %>% .$b
b <- z %>% group_by(a) %>% summarise(c=names(which(table(b) == max(table(b)))[1])) %>% .$c
We make sure these are equivalent approaches:
> identical(a, b)
#[1] TRUE
Update
As per mentioned by @docendodiscimus, you could also do:
count(z, a, b) %>% slice(which.max(n))
Here are the results on the benchmark:
library(microbenchmark)
mbm <- microbenchmark(
steven = z %>% group_by(a, b) %>% summarise(c = n()) %>% filter(row_number(desc(c))==1),
phil = z %>% group_by(a) %>% summarise(c = names(which(table(b) == max(table(b)))[1])),
docendo = count(z, a, b) %>% slice(which.max(n)),
times = 10
)
#Unit: seconds
# expr min lq mean median uq max neval cld
# steven 4.752168 4.789564 4.815986 4.813686 4.847964 4.875109 10 b
# phil 15.356051 15.378914 15.467534 15.458844 15.533385 15.606690 10 c
# docendo 4.586096 4.611401 4.669375 4.688420 4.702352 4.753583 10 a
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With