Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr mutate function to evaluate values within columns (current, previous, next) vertically

Tags:

r

dplyr

I have scoured SO for a way to achieve what I need without luck so here it goes. A while back I discovered the package dplyr and its potential. I am thinking this package can do what I want, I just don't know how. This is a small subset of my data, but should be representative of my problem.

    dummy<-structure(list(time = structure(1:20, .Label = c("2015-03-25 12:24:00", 
    "2015-03-25 21:08:00", "2015-03-25 21:13:00", "2015-03-25 21:47:00", 
    "2015-03-26 03:08:00", "2015-04-01 20:30:00", "2015-04-01 20:34:00", 
    "2015-04-01 20:42:00", "2015-04-01 20:45:00", "2015-09-29 18:26:00", 
    "2015-09-29 19:11:00", "2015-09-29 21:21:00", "2015-09-29 22:03:00", 
    "2015-09-29 22:38:00", "2015-09-30 00:48:00", "2015-09-30 01:38:00", 
    "2015-09-30 01:41:00", "2015-09-30 01:45:00", "2015-09-30 01:47:00", 
    "2015-09-30 01:49:00"), class = "factor"), ID = c(1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L), station = c(1L, 1L, 1L, 2L, 3, 
    4L, 4L, 4L, 4L, 5L, 5L, 6L, 
    6L, 5, 5, 5L, 7, 7, 7L, 
    7)), .Names = c("time", "ID", "station"), class = "data.frame", row.names = c(NA, 
    -20L))

I wish to evaluate rows within the time column conditional on the ID and station column. Specifically, I would like the function (dplyr?) to evaluate each time row, and compare the time to the previous time (row-1) and next time (row+1). If the time of current row is within 1 hour of time of previous and/or next row, and the ID and station of current row match that of previous and/or next row, then I would like to add in a new row a 1, otherwise a 0.

How would I achieve this using dplyr?

The expected outcome should be like this:

                  time ID station new.value
1  2015-03-25 12:24:00  1       1         0
2  2015-03-25 21:08:00  1       1         1
3  2015-03-25 21:13:00  1       1         1
4  2015-03-25 21:47:00  1       2         0
5  2015-03-26 03:08:00  1       3         0
6  2015-04-01 20:30:00  1       4         1
7  2015-04-01 20:34:00  1       4         1
8  2015-04-01 20:42:00  1       4         1
9  2015-04-01 20:45:00  1       4         1
10 2015-09-29 18:26:00  2       5         1
11 2015-09-29 19:11:00  2       5         1
12 2015-09-29 21:21:00  2       6         1
13 2015-09-29 22:03:00  2       6         1
14 2015-09-29 22:38:00  2       5         0
15 2015-09-30 00:48:00  2       5         1
16 2015-09-30 01:38:00  2       5         1
17 2015-09-30 01:41:00  2       7         1
18 2015-09-30 01:45:00  2       7         1
19 2015-09-30 01:47:00  2       7         1
20 2015-09-30 01:49:00  2       7         1
like image 580
FlyingDutch Avatar asked Dec 11 '22 16:12

FlyingDutch


2 Answers

Here is an option using the difftime with dplyr mutate function. Firstly, we use a group_by operation to make sure the comparison is within each unique combination of ID and Station. The difftime can be used to calculate the difference time, here the units will be set as hours for convenience. The lag and lead functions are also from dplyr package which shift the selected column backward or forward. Combining with the vectorised operation of difftime, you can calculate the time difference between the current row and the previous/next row. We use abs to make sure the result is absolute value. The condition of <1 make sure the difference is within an hour. as.integer convert the logical values (T or F) to (1 or 0) correspondingly.

library(dplyr)
dummy %>% group_by(ID, station) %>% 
          mutate(new.value = as.integer(
                 abs(difftime(time, lag(time, default = Inf), units = "hours")) < 1 | 
                 abs(difftime(time, lead(time, default = Inf), units = "hours")) < 1))

Source: local data frame [20 x 4]
Groups: ID, station [7]

                  time    ID station new.value
                (time) (int)   (dbl)     (int)
1  2015-03-25 12:24:00     1       1         0
2  2015-03-25 21:08:00     1       1         1
3  2015-03-25 21:13:00     1       1         1
4  2015-03-25 21:47:00     1       2         0
5  2015-03-26 03:08:00     1       3         0
6  2015-04-01 20:30:00     1       4         1
7  2015-04-01 20:34:00     1       4         1
8  2015-04-01 20:42:00     1       4         1
9  2015-04-01 20:45:00     1       4         1
10 2015-09-29 18:26:00     2       5         1
11 2015-09-29 19:11:00     2       5         1
12 2015-09-29 21:21:00     2       6         1
13 2015-09-29 22:03:00     2       6         1
14 2015-09-29 22:38:00     2       5         0
15 2015-09-30 00:48:00     2       5         1
16 2015-09-30 01:38:00     2       5         1
17 2015-09-30 01:41:00     2       7         1
18 2015-09-30 01:45:00     2       7         1
19 2015-09-30 01:47:00     2       7         1
20 2015-09-30 01:49:00     2       7         1
like image 142
Psidom Avatar answered May 24 '23 05:05

Psidom


Psidom's answer is great -- here's a data.table approach.

library(data.table)
setDT(dummy)
# you do NOT want a factor for your time variable
dummy[, time := as.POSIXct(time) ]
dummy[, `:=`(lag_diff = c(Inf, diff(as.numeric(time))),
             lead_diff = c(diff(as.numeric(time)), Inf)),
      by = .(ID, station) ]
dummy[, new.value := as.integer(lag_diff < 3600 | lead_diff < 3600) ]
dummy
like image 37
C8H10N4O2 Avatar answered May 24 '23 04:05

C8H10N4O2