Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate a table of pairwise counts from long-form data frame

Tags:

dataframe

r

I have a 'long-form' data frame with columns id (the primary key) and featureCode (categorical variable). Each record has between 1 and 9 values of the categorical variable. For example:

id  featureCode
5   PPLC
5   PCLI
6   PPLC
6   PCLI
7   PPL
7   PPLC
7   PCLI
8   PPLC
9   PPLC
10  PPLC

I'd like to calculate the number of times each feature code is used with the other feature codes (the "pairwise counts" of the title). At this stage, the order each feature code is used is not important. I envisage the result would be another data frame, where the rows and columns are feature codes, and the cells are counts. For example:

      PPLC  PCLI  PPL
PPLC  0     3     1
PCLI  3     0     1
PPL   1     1     0

Unfortunately, I don't know how to perform this calculation and I've drawn a blank when searching for advice (mostly, I suspect, because I don't know the correct terminology).

like image 592
Iain Dillingham Avatar asked Dec 15 '22 17:12

Iain Dillingham


1 Answers

Here is a data.table approach similar to @mrdwab

It will work best if featureCode is a character

library(data.table)

DT <- data.table(dat)
# convert to character
DT[, featureCode := as.character(featureCode)]
# subset those with >1 per id
DT2 <- DT[, N := .N, by = id][N>1]
# create all combinations of 2
# return as a data.table with these as columns `V1` and `V2`
# then count the numbers in each group
DT2[, rbindlist(combn(featureCode,2, 
      FUN = function(x) as.data.table(as.list(x)), simplify = F)), 
    by = id][, .N, by = list(V1,V2)]


     V1   V2 N
1: PPLC PCLI 3
2:  PPL PPLC 1
3:  PPL PCLI 1
like image 55
mnel Avatar answered Jan 31 '23 01:01

mnel