Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grouped recurrence by periods over a data.table

Tags:

r

data.table

I have a dataset with names, dates, and several categorical columns. Let's say

data <- data.table(name = c('Anne', 'Ben', 'Cal', 'Anne', 'Ben', 'Cal', 'Anne', 'Ben', 'Ben', 'Ben', 'Cal'),
               period = c(1,1,1,1,1,1,2,2,2,3,3), 
               category = c("A","A","A","B","B","B","A","B","A","B","A"))

Which looks like this:

  name  period  category
  Anne       1         A
   Ben       1         A
   Cal       1         A
  Anne       1         B
   Ben       1         B
   Cal       1         B
  Anne       2         A
   Ben       2         B
   Ben       2         A
   Ben       3         A
   Cal       3         B

I want to compute, for each period, how many names were present in the past period, for every group of my categorical variables. The output should be as follows:

period  category  recurrence_count
    2         A                 2   # due to Anne and Ben being on A, period 1
    2         B                 1   # due to Ben being on B, period 1
    3         A                 1   # due to Ben being on A, period 2 
    3         B                 0   # no match from B, period 2

I am aware of the .I and .GRP operators in data.table, but I have no idea how to write the notion of 'next group' in the j entry of my statement. I imagine something like this might be a reasonable path, but I can't figure out the correct syntax:

data[, .(recurrence_count = length(intersect(name, name[last(.GRP)]))), by = .(category, period)]
like image 550
pheymanss Avatar asked Mar 23 '21 22:03

pheymanss


1 Answers

You can first summarize your data by category and period.

previous_period_names <- data[, .(names = list(name)), .(category, period)]

previous_period_names[, next_period := period + 1]

Join your summary with your original data.

data[previous_period_names, names := i.names, on = c('period==next_period')]

Now count how many names you see the name in the summarized names

data[, .(recurrence_count = sum(name %in% unlist(names))), by = .(period, category)]
like image 182
Telaroz Avatar answered Sep 18 '22 00:09

Telaroz