Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter window with dplyr: find matching row, and keep subsequent N rows

Tags:

r

dplyr

I have a dataframe and I would like to filter out rows that match some condition, and the subsequent N rows following it. For example, consider a data frame which contains a hour and minutes column (representing a timestamp per row). Let's say I would like the first two records after the 0th and 6th hour. Is it possible to do this in a nice way?

set.seed(3)
df <- 
    data.frame(hour = 0:11, minutes = runif(12, 0, 59), count = rpois(12, 3)) %>%
    arrange(hour, minutes)

which produces

> df
   hour   minutes count
1     0  9.914450     3
2     1 47.643468     3
3     2 22.711599     5
4     3 19.336325     5
5     4 35.523940     1
6     5 35.659249     4
7     6  7.353373     5
8     7 17.381455     2
9     8 34.078985     2
10    9 37.227777     0
11   10 30.208938     1
12   11 29.796411     1

The normal filter returns two rows:

> df %>%
+     filter(hour%%6 == 0)
  hour  minutes count
1    0 9.914450     3
2    6 7.353373     5

However, the answer should be:

  hour   minutes count
1    0  9.914450     3
2    1 47.643468     3
3    6  7.353373     5
4    7 17.381455     2

In this case it is possible to use modulo arithmetic on the column used for filtering, but in the general case this may not possible.

The original example is provided below, where by here I wanted the first two records in each hour. In this case, Akrun's answer is good and exploits the group structure in the data. E.g.

library(dplyr)
set.seed(0)
df <- 
    data.frame(hour = rep(0:11, 3), minutes = runif(36, 0, 59), count = rpois(36, 3)) %>%
    arrange(hour, minutes)

looks like:

   hour    minutes count
1     0  7.4077507     2
2     0 10.4168484     3
3     0 52.9051348     4
4     1 15.6650111     4
5     1 15.7660195     5
6     1 40.5343480     4
7     2 21.9553101     1
8     2 22.6621194     4
9     2 22.7807315     2
10    3  0.7900297     3
11    3 33.7983484     4
12    3 45.4206438     3
...

One could do

df %>% mutate(is_even_hour = ifelse(hour %% 2 == 0, 1, 0)) %>%
    filter(is_even_hour == 1) %>%
    group_by(hour, is_even_hour) %>%
    filter(row_number() <= 2) %>%
    ungroup %>%
    select(-is_even_hour)

which gives

hour   minutes count
   <int>     <dbl> <int>
1      0  7.407751     2
2      0 10.416848     3
3      2 21.955310     1
4      2 22.662119     4
5      4 22.560889     2
6      4 29.364255     5
7      6 20.080591     2
8      6 53.004991     3
9      8 35.374384     4
10     8 38.987070     3
11    10  3.645390     4
12    10 10.986838     5
like image 785
Alex Avatar asked Mar 12 '23 02:03

Alex


2 Answers

I could think of this base R solution using sapply.

Basically, the idea is to find out indices which are completely divisible by 6 and then using seq to generate next indices to be selected.

So here as you want 2 rows after every index length.out is 2, if in future you want more (as mentioned in the comments) you can change this to whatever number you want.

y <- which(df$hour%%6 == 0)
df[sapply(y, function(x) seq(x, length.out = 2)), ]

#    hour minutes   count
#1    0  9.914450      3
#2    1  47.643468     3
#7    6  7.353373      5
#8    7  17.381455     2
like image 134
Ronak Shah Avatar answered Apr 26 '23 09:04

Ronak Shah


After grouping by 'hour', we can do this in a single filter step

df %>%
     group_by(hour) %>%
     filter(!hour%%2 & row_number() <3)
#     hour   minutes count
#    <int>     <dbl> <int>
#1      0  7.407751     2
#2      0 10.416848     3
#3      2 21.955310     1
#4      2 22.662119     4
#5      4 22.560889     2
#6      4 29.364255     5
#7      6 20.080591     2
#8      6 53.004991     3
#9      8 35.374384     4
#10     8 38.987070     3
#11    10  3.645390     4
#12    10 10.986838     5

For the updated post

i1 <- df %>% 
          filter(hour%%6 == 0) %>%
          .$hour %>% 
          rep(., each =2)+ 0:1 %>% 
          match(., df$hour) 
df[i1,]
#   hour   minutes count
#1    0  9.914450     3
#2    1 47.643468     3
#7    6  7.353373     5
#8    7 17.381455     2

Or this can be done in a compact way with data.table

library(data.table)
setDT(df)[df[, rep(which(!hour%%6), each = 2) + 0:1 ]]
#   hour   minutes count
#1:    0  9.914450     3
#2:    1 47.643468     3
#3:    6  7.353373     5
#4:    7 17.381455     2
like image 39
akrun Avatar answered Apr 26 '23 08:04

akrun