What is the best way to filter a data.frame to only get groups of say size 5? So my data looks as follows: <pre class="prettyprint"><code>require(dplyr) n <- 1e5 x <- rnorm(n) # Category size ranging each from 1 to 5 cat <- rep(seq_len(n/3), sample(1:5, n/3, replace = TRUE))[1:n] dat <- data.frame(x = x, cat = cat) </code></pre> The dplyr way i could come up with was <pre class="prettyprint"><code>dat <- group_by(dat, cat) system.time({ out1 <- dat %>% filter(n() == 5L) }) # user system elapsed # 1.157 0.218 1.497 </code></pre> But this is very slow... Is there a better way in dplyr? So far my workaround solutions looks as follows: <pre class="prettyprint"><code>system.time({ all_ind <- rep(seq_len(n_groups(dat)), group_size(dat)) take_only <- which(group_size(dat) == 5L) out2 <- dat[all_ind %in% take_only, ] }) # user system elapsed # 0.026 0.008 0.036 all.equal(out1, out2) # TRUE </code></pre> But this doesn't feel very dplyr like...

You can do it more concisely with <code>n()</code>: <pre class="prettyprint"><code>library(dplyr) dat %>% group_by(cat) %>% filter(n() == 5) </code></pre>

I know you asked for a <code>dplyr</code> solution but if you combine it with some <code>purrr</code> you can get it in one line without specifying any new functions. (A little slower though.) <pre class="prettyprint"><code>library(dplyr) library(purrr) library(tidyr) dat %>% group_by(cat) %>% nest() %>% mutate(n = map(data, n_distinct)) %>% unnest(n = n) %>% filter(n == 5) %>% select(cat, n) </code></pre>

dplyr - filter by group size

Tags:

dataframe

r

filter

dplyr

subset

What is the best way to filter a data.frame to only get groups of say size 5?

So my data looks as follows:

require(dplyr)
n <- 1e5
x <- rnorm(n)
# Category size ranging each from 1 to 5
cat <- rep(seq_len(n/3), sample(1:5, n/3, replace = TRUE))[1:n]

dat <- data.frame(x = x, cat = cat)

The dplyr way i could come up with was

dat <- group_by(dat, cat)

system.time({
  out1 <- dat %>% filter(n() == 5L)
})
#    user  system elapsed 
#   1.157   0.218   1.497

But this is very slow... Is there a better way in dplyr?

So far my workaround solutions looks as follows:

system.time({
  all_ind <- rep(seq_len(n_groups(dat)), group_size(dat))
  take_only <- which(group_size(dat) == 5L)
  out2 <- dat[all_ind %in% take_only, ]
})
#    user  system elapsed 
#   0.026   0.008   0.036
all.equal(out1, out2) # TRUE

But this doesn't feel very dplyr like...

499

asked Mar 30 '17 06:03

Rentrop

2 Answers

You can do it more concisely with n():

library(dplyr)
dat %>% group_by(cat) %>% filter(n() == 5)

141

answered Sep 29 '22 13:09

Joe

I know you asked for a dplyr solution but if you combine it with some purrr you can get it in one line without specifying any new functions. (A little slower though.)

library(dplyr)
library(purrr)
library(tidyr)

dat %>% 
  group_by(cat) %>% 
  nest() %>% 
  mutate(n = map(data, n_distinct)) %>%
  unnest(n = n) %>% 
  filter(n == 5) %>% 
  select(cat, n)

answered Sep 29 '22 13:09

ceefel

Related questions
                            
                                How to change axis-label color in ggplot2?
                            
                                Growing a list with variable names in R
                            
                                Link error installing Rcpp "library not found for -lintl"
                            
                                repeating vector of letters
                            
                                Arrange n ggplots into lower triangle matrix shape
                            
                                In gbm multinomial dist, how to use predict to get categorical output? [duplicate]
                            
                                Bayesian Network with R
                            
                                Moving down a folder in working directory
                            
                                integrate() gives totally wrong number
                            
                                Fitting several regression models by changing only one independent variable within mutate()
                            
                                facet_grid problem : input string 1 is invalid in this locale?
                            
                                Regular expressions in R to erase all characters after the first space?
                            
                                How can I get xtabs to calculate means instead of sums in R?
                            
                                Updating ggplot2 code for new version
                            
                                How to pass na.rm as argument to tapply?
                            
                                Returning first row of group
                            
                                NaiveBayes in R Cannot Predict - factor(0) Levels:
                            
                                Convert decimal day to HH:MM
                            
                                What can cause a “non-unique matches detected” error in an r merge?
                            
                                Earliest Date for each id in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With