I have a dataframe and I would like to filter out rows that match some condition, and the subsequent N rows following it. For example, consider a data frame which contains a hour and minutes column (representing a timestamp per row). Let's say I would like the first two records after the 0th and 6th hour. Is it possible to do this in a nice way?
set.seed(3)
df <-
data.frame(hour = 0:11, minutes = runif(12, 0, 59), count = rpois(12, 3)) %>%
arrange(hour, minutes)
which produces
> df
hour minutes count
1 0 9.914450 3
2 1 47.643468 3
3 2 22.711599 5
4 3 19.336325 5
5 4 35.523940 1
6 5 35.659249 4
7 6 7.353373 5
8 7 17.381455 2
9 8 34.078985 2
10 9 37.227777 0
11 10 30.208938 1
12 11 29.796411 1
The normal filter returns two rows:
> df %>%
+ filter(hour%%6 == 0)
hour minutes count
1 0 9.914450 3
2 6 7.353373 5
However, the answer should be:
hour minutes count
1 0 9.914450 3
2 1 47.643468 3
3 6 7.353373 5
4 7 17.381455 2
In this case it is possible to use modulo arithmetic on the column used for filtering, but in the general case this may not possible.
The original example is provided below, where by here I wanted the first two records in each hour. In this case, Akrun's answer is good and exploits the group structure in the data. E.g.
library(dplyr)
set.seed(0)
df <-
data.frame(hour = rep(0:11, 3), minutes = runif(36, 0, 59), count = rpois(36, 3)) %>%
arrange(hour, minutes)
looks like:
hour minutes count
1 0 7.4077507 2
2 0 10.4168484 3
3 0 52.9051348 4
4 1 15.6650111 4
5 1 15.7660195 5
6 1 40.5343480 4
7 2 21.9553101 1
8 2 22.6621194 4
9 2 22.7807315 2
10 3 0.7900297 3
11 3 33.7983484 4
12 3 45.4206438 3
...
One could do
df %>% mutate(is_even_hour = ifelse(hour %% 2 == 0, 1, 0)) %>%
filter(is_even_hour == 1) %>%
group_by(hour, is_even_hour) %>%
filter(row_number() <= 2) %>%
ungroup %>%
select(-is_even_hour)
which gives
hour minutes count
<int> <dbl> <int>
1 0 7.407751 2
2 0 10.416848 3
3 2 21.955310 1
4 2 22.662119 4
5 4 22.560889 2
6 4 29.364255 5
7 6 20.080591 2
8 6 53.004991 3
9 8 35.374384 4
10 8 38.987070 3
11 10 3.645390 4
12 10 10.986838 5
I could think of this base R
solution using sapply
.
Basically, the idea is to find out indices which are completely divisible by 6 and then using seq
to generate next indices to be selected.
So here as you want 2 rows after every index length.out
is 2, if in future you want more (as mentioned in the comments) you can change this to whatever number you want.
y <- which(df$hour%%6 == 0)
df[sapply(y, function(x) seq(x, length.out = 2)), ]
# hour minutes count
#1 0 9.914450 3
#2 1 47.643468 3
#7 6 7.353373 5
#8 7 17.381455 2
After grouping by 'hour', we can do this in a single filter
step
df %>%
group_by(hour) %>%
filter(!hour%%2 & row_number() <3)
# hour minutes count
# <int> <dbl> <int>
#1 0 7.407751 2
#2 0 10.416848 3
#3 2 21.955310 1
#4 2 22.662119 4
#5 4 22.560889 2
#6 4 29.364255 5
#7 6 20.080591 2
#8 6 53.004991 3
#9 8 35.374384 4
#10 8 38.987070 3
#11 10 3.645390 4
#12 10 10.986838 5
For the updated post
i1 <- df %>%
filter(hour%%6 == 0) %>%
.$hour %>%
rep(., each =2)+ 0:1 %>%
match(., df$hour)
df[i1,]
# hour minutes count
#1 0 9.914450 3
#2 1 47.643468 3
#7 6 7.353373 5
#8 7 17.381455 2
Or this can be done in a compact way with data.table
library(data.table)
setDT(df)[df[, rep(which(!hour%%6), each = 2) + 0:1 ]]
# hour minutes count
#1: 0 9.914450 3
#2: 1 47.643468 3
#3: 6 7.353373 5
#4: 7 17.381455 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With