R: Using dplyr to count number of occurence 1 hour ahead

Tags:

was trying to figure a way to use dplyr to count the number of occurrences for each id at each time 1 hour ahead. Tried using a for loop but it doesn't give me the desired result. Went through stack and tried looking for various methods but to no avail. Any advise or help is greatly appreciated. Thanks

Dataset: https://drive.google.com/file/d/1U186SeBWYyTnJVgUPmow7yknr6K9vu8i/view?usp=sharing

  id           date_time count
1  1 2019-12-27 00:00:00    NA
2  2 2019-12-27 00:00:00    NA
3  2 2019-12-27 00:55:00    NA
4  2 2019-12-27 01:00:00    NA
5  2 2019-12-28 01:00:00    NA
6  3 2019-12-27 22:00:00    NA
7  3 2019-12-27 22:31:00    NA
8  3 2019-12-28 14:32:00    NA

Desired Output

  id           date_time count
1  1 2019-12-27 00:00:00    1     #Count = 1 since there is no other cases 1 hour ahead but itself, only 1 case of id=1 
2  2 2019-12-27 00:00:00    3     #Count = 3 as there are 3 cases from 00:00 to 01:00 on 27/12
3  2 2019-12-27 00:55:00    2     #Count = 2 as there are 2 cases from 00:55 to 01:55 on 27/12
4  2 2019-12-27 01:00:00    1     #Count = 1 as only itself from 01:00 to 02:00 on 27/12
5  2 2019-12-28 01:00:00    1     #Count = 1 as only itself from 01:00 to 02:00 on 28/12
6  3 2019-12-27 22:00:00    2
7  3 2019-12-27 22:31:00    1
8  3 2019-12-28 14:32:00    1

My codes (I'm stuck):

library(tidyverse)

data <- read.csv('test.csv')
data$date_time <- as.POSIXct(data$date_time)
data$count <- NA

data %>% 
  group_by(id) %>%
  arrange(date_time, .by_group=TRUE)

#Doesn't give the desired output
for (i in 1:nrow(data)){
  data$count[i] <- nrow(data[data$date_time<=data$date_time[i]+1*60*60 & data$date_time>=data$date_time[i],])
}

702

asked Jun 13 '20 17:06

Alexander Peterson

3 Answers

If OP is only looking for tidyverse solution. I am happy to delete this.

Here is an approach using data.table non-equi join:

DT[, onehrlater := date_time + 60*60] 
DT[, count :=
  DT[DT, on=.(id, date_time>=date_time, date_time<=onehrlater),
    by=.EACHI, .N]$N
]

How to read this:

1) DT[, onehrlater := date_time + 60*60] creates a new column of POSIX date time that is one hour later. := updates the original dataset by reference.

2) DT[DT, on=.(id, date_time>=date_time, date_time<=onehrlater) performs a self non-equi join such that all rows with i) the same id, ii) date_time after this row's date_time and iii) date_time before this row's date_time one hour later are joined to this row.

3) by=.EACHI, .N returns the count for each of those rows. And $N accesses the output of this self non-equi join. And DT[, count := ...] updates the original dataset by reference.

output:

   id           date_time          onehrlater count
1:  1 2019-12-27 00:00:00 2019-12-27 01:00:00     1
2:  2 2019-12-27 00:00:00 2019-12-27 01:00:00     3
3:  2 2019-12-27 00:55:00 2019-12-27 01:55:00     2
4:  2 2019-12-27 01:00:00 2019-12-27 02:00:00     1
5:  2 2019-12-28 01:00:00 2019-12-28 02:00:00     1
6:  3 2019-12-27 22:00:00 2019-12-27 23:00:00     2
7:  3 2019-12-27 22:31:00 2019-12-27 23:31:00     1
8:  3 2019-12-28 14:32:00 2019-12-28 15:32:00     1

data:

library(data.table)
DT <- fread("id           date_time 
1 2019-12-27T00:00:00
2 2019-12-27T00:00:00
2 2019-12-27T00:55:00
2 2019-12-27T01:00:00
2 2019-12-28T01:00:00
3 2019-12-27T22:00:00
3 2019-12-27T22:31:00
3 2019-12-28T14:32:00")
DT[, date_time := as.POSIXct(date_time, format="%Y-%m-%dT%T")]

105

answered Oct 06 '22 16:10

chinsoon12

The question can be solved using a non-equi self join (in data.table speak). Unfortunately, this is not yet available with dplyr, AFAIK.

Here is an implementation using SQL:

library(sqldf)
sqldf("
select d1.id, d1.date_time, count(d2.date_time) as count 
  from dat as d1, dat as d2
  where d1.id = d2.id and d1.date_time between d2.date_time and (d2.date_time + 60*60)
  group by d2.id, d2.date_time")

  id           date_time count
1  1 2019-12-27 00:00:00     1
2  2 2019-12-27 00:00:00     3
3  2 2019-12-27 00:55:00     2
4  2 2019-12-27 01:00:00     1
5  2 2019-12-28 01:00:00     1
6  3 2019-12-27 22:00:00     2
7  3 2019-12-27 22:31:00     1
8  3 2019-12-28 14:32:00     1

Data

# reading directly from google drive, see https://stackoverflow.com/a/33142446/3817004
dat <- data.table::fread(
  "https://drive.google.com/uc?id=1U186SeBWYyTnJVgUPmow7yknr6K9vu8i&export=download")[
    , date_time := anytime::anytime(date_time)]

answered Oct 06 '22 15:10

Uwe

Maybe fuzzyjoin might be helpful here. You can create time ranges for each row of data (setting the end_time to 3600 seconds or 1 hour after each time). Then, you can do a fuzzy join with itself, where the date_time falls between this range to be counted as within the hour.

library(tidyverse)
library(fuzzyjoin)

df %>%
  mutate(row_id = row_number(),
         end_time = date_time + 3600) %>%
  fuzzy_inner_join(df, 
                  by = c("id", "date_time" = "date_time", "end_time" = "date_time"), 
                  match_fun = list(`==`, `<=`, `>=`)) %>%
  group_by(row_id) %>%
  summarise(id = first(id.x),
            date_time = first(date_time.x),
            count = n())

Output

# A tibble: 8 x 4
  row_id    id date_time           count
   <int> <int> <dttm>              <int>
1      1     1 2019-12-27 00:00:00     1
2      2     2 2019-12-27 00:00:00     3
3      3     2 2019-12-27 00:55:00     2
4      4     2 2019-12-27 01:00:00     1
5      5     2 2019-12-28 01:00:00     1
6      6     3 2019-12-27 22:00:00     2
7      7     3 2019-12-27 22:31:00     1
8      8     3 2019-12-28 14:32:00     1

answered Oct 06 '22 15:10

Ben

Related questions
                            
                                Display python plotly graph in RMarkdown html document
                            
                                ggplot2 add a guide for abbreviations
                            
                                How to set the font size of data label in fviz_pca_var of factoextra
                            
                                Rstudio is painfully slow
                            
                                Meaning of error using . shorthand inside dplyr function
                            
                                Different hard threshold for each column
                            
                                Copy On Modify; What Happens When You Run This Code? x <- list(1:10); x[[2]] <- x
                            
                                How does subsetting with NA work?
                            
                                Generate progress bar in modal in shiny app, that closes automatically
                            
                                Visualising big set of points with third feature as a color - a way to improve a speed
                            
                                converting a dgCMatrix to data frame
                            
                                What is causing this error? Coefficients not defined because of singularities
                            
                                Adding boxplot below density plot
                            
                                Equivalent for Stata's egen group() function
                            
                                How to make plot title partly bold?
                            
                                What's a tidyverse approach to iterating over rows in a data frame when vectorisation is not feasible?
                            
                                Scoping and evaluating functions in R
                            
                                Issues compiling Rpackage: error in asNamespace(ns) using Rcpp
                            
                                I want to apply two functions one function on the block diagonal and the second function on the off-diagonal elements in the data frame
                            
                                Dividing selected columns by vector in dplyr

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

R: Using dplyr to count number of occurence 1 hour ahead

Tags:

r

dplyr

posixct

Alexander Peterson

People also ask

3 Answers

chinsoon12

Data

Uwe

Ben

Recent Activity

Donate For Us