I have a dataset with names, dates, and several categorical columns. Let's say
data <- data.table(name = c('Anne', 'Ben', 'Cal', 'Anne', 'Ben', 'Cal', 'Anne', 'Ben', 'Ben', 'Ben', 'Cal'),
period = c(1,1,1,1,1,1,2,2,2,3,3),
category = c("A","A","A","B","B","B","A","B","A","B","A"))
Which looks like this:
name period category
Anne 1 A
Ben 1 A
Cal 1 A
Anne 1 B
Ben 1 B
Cal 1 B
Anne 2 A
Ben 2 B
Ben 2 A
Ben 3 A
Cal 3 B
I want to compute, for each period, how many names were present in the past period, for every group of my categorical variables. The output should be as follows:
period category recurrence_count
2 A 2 # due to Anne and Ben being on A, period 1
2 B 1 # due to Ben being on B, period 1
3 A 1 # due to Ben being on A, period 2
3 B 0 # no match from B, period 2
I am aware of the .I and .GRP operators in data.table, but I have no idea how to write the notion of 'next group' in the j entry of my statement. I imagine something like this might be a reasonable path, but I can't figure out the correct syntax:
data[, .(recurrence_count = length(intersect(name, name[last(.GRP)]))), by = .(category, period)]
You can first summarize your data by category and period.
previous_period_names <- data[, .(names = list(name)), .(category, period)]
previous_period_names[, next_period := period + 1]
Join your summary with your original data.
data[previous_period_names, names := i.names, on = c('period==next_period')]
Now count how many names you see the name in the summarized names
data[, .(recurrence_count = sum(name %in% unlist(names))), by = .(period, category)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With