Subset data frame based on number of rows per group

Tags:

I have data like this, where some "name" occurs more than three times:

df <- data.frame(name = c("a", "a", "a", "b", "b", "c", "c", "c", "c"), x = 1:9)    name x 1    a 1 2    a 2 3    a 3 4    b 4 5    b 5 6    c 6 7    c 7 8    c 8 9    c 9

I wish to subset (filter) the data based on number of rows (observations) within each level of the name variable. If a certain level of name occurs more than say 3 times, I want to remove all rows belonging to that level. So in this example, we would drop observations where name == c, since there are > 3 rows in that group:

  name x 1    a 1 2    a 2 3    a 3 4    b 4 5    b 5

I wrote this code, but can't get it to work.

as.data.frame(table(unique(df)$name)) subset(df, name > 3)

956

asked Nov 25 '13 21:11

SJSU2013

2 Answers

First, two base alternatives. One relies on table, and the other on ave and length. Then, two data.table ways.

1. `table`

tt <- table(df$name)  df2 <- subset(df, name %in% names(tt[tt < 3])) # or df2 <- df[df$name %in% names(tt[tt < 3]), ]

If you want to walk it through step by step:

# count each 'name', assign result to an object 'tt' tt <- table(df$name)  # which 'name' in 'tt' occur more than three times? # Result is a logical vector that can be used to subset the table 'tt' tt < 3  # from the table, select 'name' that occur < 3 times tt[tt < 3]  # ...their names names(tt[tt < 3])  # rows of 'name' in the data frame that matches "the < 3 names" # the result is a logical vector that can be used to subset the data frame 'df' df$name %in% names(tt[tt < 3])  # subset data frame by a logical vector # 'TRUE' rows are kept, 'FALSE' rows are removed. # assign the result to a data frame with a new name df2 <- subset(df, name %in% names(tt[tt < 3])) # or df2 <- df[df$name %in% names(tt[tt < 3]), ]

2. `ave` and `length`

As suggested by @flodel:

df[ave(df$x, df$name, FUN = length) < 3, ]

3. `data.table`: `.N` and `.SD`:

library(data.table) setDT(df)[, if (.N < 3) .SD, by = name]

4. `data.table`: `.N` and `.I`:

setDT(df) df[df[, .I[.N < 3], name]$V1]

See also the related Q&A Count number of observations/rows per group and add result to data frame.

answered Oct 11 '22 17:10

Henrik

Using the dplyr package:

df %>%   group_by(name) %>%   filter(n() < 4)  # A tibble: 5 x 2 # Groups:   name [2]   name      x   <fct> <int> 1 a         1 2 a         2 3 a         3 4 b         4 5 b         5

n() returns the number of observations in the current group, so we can group_by name, and then keep only those rows which are part of a group where the number of rows in that group is less than 4.

answered Oct 11 '22 18:10

Joe

Related questions
                            
                                cartesian product with dplyr R
                            
                                hiding personal functions in R
                            
                                Only download sources of a package and all dependencies
                            
                                Setting y axis breaks in ggplot
                            
                                dplyr left_join by less than, greater than condition
                            
                                Loop over rows of dataframe applying function with if-statement
                            
                                How can I get the average (mean) of selected columns
                            
                                percentage on y lab in a faceted ggplot barchart?
                            
                                Splitting a dataframe string column into multiple different columns [duplicate]
                            
                                R: ggplot2, can I set the plot title to wrap around and shrink the text to fit the plot?
                            
                                Deselecting a column by name
                            
                                Horizontal/Vertical Line in plotly
                            
                                How to do selective labeling with GGPLOT geom_point()
                            
                                Rounding time to nearest quarter hour
                            
                                First entry from string split
                            
                                Loop through columns and add string lengths as new columns
                            
                                How to get two windows with different plots
                            
                                Get indexes of a vector of numbers in another vector
                            
                                Create empty dataframe in R with same columns
                            
                                ignore NA in dplyr row sum

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Subset data frame based on number of rows per group

Tags:

dataframe

r

r-faq

subset

SJSU2013

People also ask

2 Answers

1. `table`

2. `ave` and `length`

3. `data.table`: `.N` and `.SD`:

4. `data.table`: `.N` and `.I`:

Henrik

Joe

Recent Activity

Donate For Us

Subset data frame based on number of rows per group

Tags:

dataframe

r

r-faq

subset

SJSU2013

People also ask

2 Answers

1. table

2. ave and length

3. data.table: .N and .SD:

4. data.table: .N and .I:

Henrik

Joe

Related questions

Recent Activity

Donate For Us

1. `table`

2. `ave` and `length`

3. `data.table`: `.N` and `.SD`:

4. `data.table`: `.N` and `.I`: