Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count combinations of length 2 per id

Tags:

r

data.table

I have a largish data.table with two columns, id and var:

head(DT)
#    id var
# 1:  1   B
# 2:  1   C
# 3:  1   A
# 4:  1   C
# 5:  2   B
# 6:  2   C

I would like to create a kind of cross-table that would show how many times different length 2-combinations of var occured in the data.

Expected output for the sample data:

out
#    A  B C
# A  0  3 3
# B NA  1 3
# C NA NA 0

Explanation:

  • the diagonal of the resulting matrix/data.frame/data.table counts how many times all vars that occured for an id were all the same (either all A, or B, or C). In the sample data, id 4 only has one entry and that is B, so B - B is 1 in the desired result.
  • the upper triangle counts for how many ids two specific vars were present, i.e. the combination A - B is present in 3 ids, as are combinations A - C and B - C.
  • Note that for any id, a single combination of two vars can only be either 0 (not present) or 1 (present), i.e. I don't want to count it multiple times per id.
  • the lower triangle of the result can be left NA, or 0, or it could have the same values as the upper triangle, but that would be redundant.

(The result could also be given in long-format as long as the relevant information is present.)

I'm sure there's a clever (efficient) way of computing this, but I can't currently wrap my head around it.

Sample data:

DT <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L), var = c("B", "C", "A", 
"C", "B", "C", "C", "A", "B", "B", "C", "C", "C", "C", "B", "C", 
"B", "A", "C", "B")), .Names = c("id", "var"), row.names = c(NA, 
-20L), class = "data.frame")

library(data.table)
setDT(DT, key = "id")
like image 973
talat Avatar asked Apr 29 '16 15:04

talat


1 Answers

Since you're ok with long-form results:

DT[, if(all(var == var[1]))
       .(var[1], var[1])
     else
       as.data.table(t(combn(sort(unique(var)), 2))), by = id][
   , .N, by = .(V1, V2)]
#   V1 V2 N
#1:  A  B 3
#2:  A  C 3
#3:  B  C 3
#4:  B  B 1

Or if we call the above output res:

dcast(res[CJ(c(V1,V2), c(V1,V2), unique = T), on = c('V1', 'V2')][
          V1 == V2 & is.na(N), N := 0], V1 ~ V2)
#   V1  A  B C
#1:  A  0  3 3
#2:  B NA  1 3
#3:  C NA NA 0

An alternative to combn is doing:

DT[, if (all(var == var[1]))
       .(var[1], var[1])
     else
       CJ(var, var, unique = T)[V1 < V2], by = id][
   , .N, by = .(V1, V2)]
#    V1 V2 N
# 1:  A  B 3
# 2:  A  C 3
# 3:  B  C 3
# 4:  B  B 1

# or combn with list output (instead of matrix)

unique(DT, by=NULL)[ order(var), if(.N==1L)
       .(var, var)
     else
       transpose(combn(var, 2, simplify=FALSE)), by = id][
   , .N, by = .(V1, V2)]
like image 185
eddi Avatar answered Sep 28 '22 21:09

eddi