I have a 'long-form' data frame with columns id
(the primary key) and featureCode
(categorical variable). Each record has between 1 and 9 values of the categorical variable. For example:
id featureCode
5 PPLC
5 PCLI
6 PPLC
6 PCLI
7 PPL
7 PPLC
7 PCLI
8 PPLC
9 PPLC
10 PPLC
I'd like to calculate the number of times each feature code is used with the other feature codes (the "pairwise counts" of the title). At this stage, the order each feature code is used is not important. I envisage the result would be another data frame, where the rows and columns are feature codes, and the cells are counts. For example:
PPLC PCLI PPL
PPLC 0 3 1
PCLI 3 0 1
PPL 1 1 0
Unfortunately, I don't know how to perform this calculation and I've drawn a blank when searching for advice (mostly, I suspect, because I don't know the correct terminology).
Here is a data.table
approach similar to @mrdwab
It will work best if featureCode
is a character
library(data.table)
DT <- data.table(dat)
# convert to character
DT[, featureCode := as.character(featureCode)]
# subset those with >1 per id
DT2 <- DT[, N := .N, by = id][N>1]
# create all combinations of 2
# return as a data.table with these as columns `V1` and `V2`
# then count the numbers in each group
DT2[, rbindlist(combn(featureCode,2,
FUN = function(x) as.data.table(as.list(x)), simplify = F)),
by = id][, .N, by = list(V1,V2)]
V1 V2 N
1: PPLC PCLI 3
2: PPL PPLC 1
3: PPL PCLI 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With