I am new to R programming and attempting to remove certain rows per a group of rows after a filtering criteria has been met.
Scenario: For each GROUP, if 2 TYPE "B" are in a row, remove all the following rows for that GROUP. The "Include in DataSet" column shows what the output should be.
Here is my example input:
GROUP TYPE Include in DataSet?
--------------------------------------------
1 A yes
1 A yes
1 B yes
1 B yes
1 B no
2 A yes
2 B yes
2 B yes
2 A no
2 B no
2 B no
DF = structure(list(GROUP = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L), TYPE = c("A", "A", "B", "B", "B", "A", "B", "B", "A",
"B", "B"), inc = c("yes", "yes", "yes", "yes", "no", "yes", "yes",
"yes", "no", "no", "no")), .Names = c("GROUP", "TYPE", "inc"), row.names = c(NA,
-11L), class = "data.frame")
Expected Output:
GROUP TYPE Include in DataSet?
--------------------------------------------
1 A yes
1 A yes
1 B yes
1 B yes
2 A yes
2 B yes
2 B yes
I have tried writing some code, with no luck due to grouping issue.
i=1
j=2
x <- allrows
for (i in x){
for(j in x){
if(i==j){
a$REMOVE=1
}
else{
a$REMOVE=2
}
}
}
You could do this by creating a new variable that identifies "double B" rows, then filter out rows after the first "double B" row in the group:
library(dplyr)
df %>%
group_by(GROUP) %>%
# Create new variable that tests if each row and the one below it TYPE==B
mutate(double_B = (TYPE == 'B' & lag(TYPE) == 'B')) %>%
# Find the first row with `double_B` in each group, filter out rows after it
filter(row_number() <= min(which(double_B == TRUE))) %>%
# Optionally, remove `double_B` column when done with it
select(-double_B)
# A tibble: 7 x 3
# Groups: GROUP [2]
GROUP TYPE IncludeinDataSet
<int> <chr> <chr>
1 1 A yes
2 1 A yes
3 1 B yes
4 1 B yes
5 2 A yes
6 2 B yes
7 2 B yes
As @Frank points out in the comment, you don't need to create the double_B
variable: you can just test for the "double B" condition in the which
statement inside the filter
:
df %>%
group_by(GROUP) %>%
# Find the first row with `double_B` in each group, filter out rows after it
filter(row_number() <= min(which(TYPE == 'B' & lag(TYPE) == 'B')))
Also, it will return a warning if no "double B" condition is found in a group, but will still filter properly
This can be done by checking the current value of 'TYPE' with the next value of 'TYPE' to find the numeric index, use seq_len
to get the sequence from 1 to that number for subsetting the rows (inside slice
)
library(dplyr)
df1 %>%
group_by(GROUP) %>%
slice(seq_len(which((TYPE == "B") & lead(TYPE) == "B")[1] + 1))
# A tibble: 7 x 3
# Groups: GROUP [2]
# GROUP TYPE IncludeInDataSet
# <int> <chr> <chr>
#1 1 A yes
#2 1 A yes
#3 1 B yes
#4 1 B yes
#5 2 A yes
#6 2 B yes
#7 2 B yes
df1 <- structure(list(GROUP = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L), TYPE = c("A", "A", "B", "B", "B", "A", "B", "B", "A",
"B", "B"), IncludeInDataSet = c("yes", "yes", "yes", "yes", "no",
"yes", "yes", "yes", "no", "no", "no")), class = "data.frame",
row.names = c(NA, -11L))
A different approach could be:
library(dplyr)
library(data.table)
df %>%
group_by(GROUP, rleid(TYPE)) %>%
mutate(temp = seq_along(TYPE)) %>%
ungroup() %>%
group_by(GROUP) %>%
filter(row_number() <= min(which(TYPE == "B" & temp == 2))) %>%
select(GROUP, TYPE, IncludeInDataSet)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With