Logo Questions Linux Laravel Mysql Ubuntu Git Menu

How to Filter out Rows per Group after Condition Occurrs




I am new to R programming and attempting to remove certain rows per a group of rows after a filtering criteria has been met.

Scenario: For each GROUP, if 2 TYPE "B" are in a row, remove all the following rows for that GROUP. The "Include in DataSet" column shows what the output should be.

Here is my example input:

GROUP   TYPE    Include in DataSet?
1       A       yes
1       A       yes
1       B       yes
1       B       yes
1       B       no
2       A       yes
2       B       yes
2       B       yes
2       A       no
2       B       no
2       B       no

DF = structure(list(GROUP = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
2L, 2L), TYPE = c("A", "A", "B", "B", "B", "A", "B", "B", "A", 
"B", "B"), inc = c("yes", "yes", "yes", "yes", "no", "yes", "yes", 
"yes", "no", "no", "no")), .Names = c("GROUP", "TYPE", "inc"), row.names = c(NA, 
-11L), class = "data.frame")

Expected Output:

GROUP   TYPE    Include in DataSet?
1       A       yes
1       A       yes
1       B       yes
1       B       yes
2       A       yes
2       B       yes
2       B       yes

I have tried writing some code, with no luck due to grouping issue.

x <- allrows
for (i in x){
  for(j in x){
like image 964
BobcatBlitz Avatar asked Oct 11 '18 18:10


3 Answers

You could do this by creating a new variable that identifies "double B" rows, then filter out rows after the first "double B" row in the group:

df %>%
    group_by(GROUP) %>%
    # Create new variable that tests if each row and the one below it TYPE==B
    mutate(double_B = (TYPE == 'B' & lag(TYPE) == 'B')) %>%
    # Find the first row with `double_B` in each group, filter out rows after it
    filter(row_number() <= min(which(double_B == TRUE))) %>%
    # Optionally, remove `double_B` column when done with it

# A tibble: 7 x 3
# Groups:   GROUP [2]
  GROUP TYPE  IncludeinDataSet
  <int> <chr> <chr>           
1     1 A     yes             
2     1 A     yes             
3     1 B     yes             
4     1 B     yes             
5     2 A     yes             
6     2 B     yes             
7     2 B     yes       

As @Frank points out in the comment, you don't need to create the double_B variable: you can just test for the "double B" condition in the which statement inside the filter:

df %>%
    group_by(GROUP) %>%
    # Find the first row with `double_B` in each group, filter out rows after it
    filter(row_number() <= min(which(TYPE == 'B' & lag(TYPE) == 'B')))

Also, it will return a warning if no "double B" condition is found in a group, but will still filter properly

like image 175
divibisan Avatar answered Nov 04 '22 20:11


This can be done by checking the current value of 'TYPE' with the next value of 'TYPE' to find the numeric index, use seq_len to get the sequence from 1 to that number for subsetting the rows (inside slice)

df1 %>% 
  group_by(GROUP) %>% 
  slice(seq_len(which((TYPE == "B") & lead(TYPE) == "B")[1] + 1))
# A tibble: 7 x 3
# Groups:   GROUP [2]
#  GROUP TYPE  IncludeInDataSet
#  <int> <chr> <chr>           
#1     1 A     yes             
#2     1 A     yes             
#3     1 B     yes             
#4     1 B     yes             
#5     2 A     yes             
#6     2 B     yes             
#7     2 B     yes          


df1 <- structure(list(GROUP = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
 2L, 2L), TYPE = c("A", "A", "B", "B", "B", "A", "B", "B", "A", 
 "B", "B"), IncludeInDataSet = c("yes", "yes", "yes", "yes", "no", 
  "yes", "yes", "yes", "no", "no", "no")), class = "data.frame", 
 row.names = c(NA, -11L))
like image 40
akrun Avatar answered Nov 04 '22 22:11


A different approach could be:


df %>%
  group_by(GROUP, rleid(TYPE)) %>%
  mutate(temp = seq_along(TYPE)) %>%
  ungroup() %>%
  group_by(GROUP) %>%
  filter(row_number() <= min(which(TYPE == "B" & temp == 2))) %>%
  select(GROUP, TYPE, IncludeInDataSet)
like image 1
tmfmnk Avatar answered Nov 04 '22 22:11
