First of all, please let me know if what I'm doing is poor use of dplyr because I'm not sure if I'm approaching this in the best way. I have the following dataframe:
mydf = data.frame(user = c(7,7,7,7,7,7,7,8,8,8,8,8,8),
col1 = c('0','0','1','1','0','3','NULL','3','3','0','1','0','0'),
col2 = runif(n=13),
col3 = letters[1:13],
stringsAsFactors = FALSE)
> mydf
user col1 col2 col3
1 7 0 0.7607907 a
2 7 0 0.1580448 b
3 7 1 0.8063540 c
4 7 1 0.7331512 d
5 7 0 0.2433631 e
6 7 3 0.2357065 f
7 7 NULL 0.4864172 g
8 8 3 0.6806089 h
9 8 3 0.2229874 i
10 8 0 0.6187911 j
11 8 1 0.7617177 k
12 8 0 0.5884821 l
13 8 0 0.4985750 m
The filtering I would like to do is a bit wordy, but I'll try - I would like to filter the dataframe by removing all rows where col1 == '0' if that row occurs AFTER the first row for that user where col1 == '1'. (bolded indicates i messed up the original question, and switched the 0 and 1).
For example, for user 7, the 3rd row has col1 == '1', so I'd like to filter all rows after row 3 where col1 == '0' (in this case, only row 5). Then, for user 8, the 11th row is the first row for this user where col1 == '1', so I'd then like to filter rows 12 and 13, since col1 == '0'.
My final output should be something like this:
> mydf
user col1 col2 col3
1 7 0 0.7607907 a
2 7 0 0.1580448 b
3 7 1 0.8063540 c
4 7 1 0.7331512 d
6 7 3 0.2357065 f
7 7 NULL 0.4864172 g
8 8 3 0.6806089 h
9 8 3 0.2229874 i
10 8 0 0.6187911 j
11 8 1 0.7617177 k
I've tried the following, but it didn't work. I thought adding a rownums column, then grouping by user, and then filtering how I've described would work. My thoughts are that there is something wrong with my filter call:
mydf %>%
mutate(rownums = 1:nrow(mydf)) %>%
group_by(user) %>%
filter(!(col1 == "0" & rownums > min(which(col1 == "1"))))
# A tibble: 9 x 5
# Groups: col0 [2]
user col1 col2 col3 rownums
<dbl> <chr> <dbl> <chr> <int>
1 7 0 0.2088034 a 1
2 7 0 0.2081894 b 2
3 7 1 0.1825428 c 3
4 7 1 0.2143353 d 4
5 7 3 0.1979774 f 6
6 7 NULL 0.2990799 g 7
7 8 3 0.7808038 h 8
8 8 3 0.1694272 i 9
9 8 1 0.1526450 k 11
The difference between this output, and the correct output, is that this output incorrectly also filtered row 10 of the original dataframe.
Any help with this is appreciated!
EDIT - I'm particularly curious if group_by() %>% filter() is bad practice in R with regards to dplyr. 99% of my group_by() are followed by summarize(), which obviously makes more sense.
EDIT2 - I think I've got it!
mydf %>%
group_by(col0) %>%
mutate(rownums = 1:length(col0)) %>%
filter(!(col1 == "0" & rownums > min(which(col1 == "1"))))
Simply flipping the order of the mutate() and group_by() calls, and tweaking the mutate() call a bit, looks to have gotten it done. I'm open to hearing a better approach though.
group_by() takes an existing tbl and converts it into a grouped tbl where operations are performed "by group". ungroup() removes grouping. The filter() function is used to subset a data frame, retaining all rows that satisfy your conditions. To be retained, the row must produce a value of TRUE for all conditions.
By using group_by() function from dplyr package we can perform group by on multiple columns or variables (two or more columns) and summarise on multiple columns for aggregations.
How to subset the data frame (DataFrame) by column value and name in R? By using R base df[] notation, or subset() you can easily subset the R Data Frame (data. frame) by column value or by column name.
Use inbuilt data sets or create a new data set and look at top few rows in the data set. Then, look at the bottom few rows in the data set. Check the data structure. Filter the data by categorical column using split function.
There is a cumany
function, which is useful for these sequential conditions, like this:
mydf %>%
group_by(user) %>%
mutate(seen_one = cumany(col1 == "1")) %>%
filter(!seen_one | col1 != "0")
That is mark all rows after "1"
has been in the 'stream' with seen_one
, and then keep the rows where one of the conditions is not met. (Semantics of filter
require to invert the condition to 'get rid' of the rows, !(A & B) == !A | !B
.)
Could be solved by updating your attempt a little :
library(dplyr)
mydf %>%
group_by(user) %>%
filter(col1 != 0 | row_number() < which.max(col1 == 1))
# user col1 col2 col3
# <dbl> <chr> <dbl> <chr>
# 1 7 0 0.756522673 a
# 2 7 0 0.168314555 b
# 3 7 1 0.977254798 c
# 4 7 1 0.722721694 d
# 5 7 3 0.407849378 f
# 6 7 NULL 0.245335151 g
# 7 8 3 0.003423735 h
# 8 8 3 0.191716738 i
# 9 8 0 0.626846893 j
#10 8 1 0.546459621 k
Using filter
we select all the rows where col1
is not equal to 0 or those rows where the current row is less than the index of first occurrence of 1 for that group.
Here is an idea via dplyr
library(dplyr)
df %>%
group_by(user) %>%
mutate(id1 = row_number(), new_col = max(which(col1 == 1)+1)) %>%
filter(!(col1 == 0 & id1 >= new_col))
which gives,
# A tibble: 10 x 6 # Groups: user [2] user col1 col2 col3 id1 new_col <dbl> <chr> <dbl> <chr> <int> <dbl> 1 7 0 0.54742608 a 1 5 2 7 0 0.89271859 b 2 5 3 7 1 0.48999057 c 3 5 4 7 1 0.17163211 d 4 5 5 7 3 0.96146770 f 6 5 6 7 NULL 0.31368382 g 7 5 7 8 3 0.82051455 h 1 5 8 8 3 0.30705440 i 2 5 9 8 0 0.18545358 j 3 5 10 8 1 0.04834678 k 4 5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With