(For the following, I could either an R data.frame or R data.table. Both are ok.)
I have the following data.table:
library(data.table)
dt = data.table(V1=c("dog", "dog", "cat", "cat", "cat", "bird","bird","bird","bird"),
V2=rep(42, 9), V3=c(1, 2, 4, 5, 7, 1, 2, 5, 8))
> print(dt)
V1 V2 V3
1: dog 42 1
2: dog 42 2
3: cat 42 4
4: cat 42 5
5: cat 42 7
6: bird 42 1
7: bird 42 2
8: bird 42 5
9: bird 42 8
Column V3
contains integers from 1 to 8. My goal is to populate an 8 by 8 zero matrix with the count of each combination "pair" given the unique category in column V1
So, the combination pairs for dog
, cat
, and bird
are:
dog: (1, 2)
cat: (4, 5), (4, 7), (5, 7)
bird: (1, 2), (1, 5), (1, 8), (2, 5), (2, 8), (5, 8)
For each pair, I add +1
to the corresponding entry in the zero matrix. For this matrix, (n, m) = (m, n)
. The matrix given dt
would be:
1 2 3 4 5 6 7 8
1: 0 2 0 0 1 0 0 1
2: 2 0 0 0 1 0 0 1
3: 0 0 0 0 0 0 0 0
4: 0 0 0 0 1 0 1 0
5: 1 1 0 1 0 0 1 1
6: 0 0 0 0 0 0 0 0
7: 0 0 0 1 1 0 0 0
8: 1 1 0 0 1 0 0 0
Note that (1,2)=(2,1)
has a count 2, from the dog
combination and the bird
combination.
(1) Is there a method to calculate the combinations of values in an R data.table/data.frame column, given the unique value in another column?
Perhaps it would make sense to output an R list, with vector "pairs", e.g.
list(c(1, 2), c(2, 1), c(4, 5), c(4, 7), c(5, 7), c(5, 4), c(7, 4), c(7, 5),
c(1, 2), c(1, 5), c(1, 8), c(2, 5), c(2, 8), c(5, 8), c(2, 1), c(5, 1),
c(8, 1), c(5, 2), c(8, 2), c(8, 5))
However, I'm not sure how I would use this to populate a matrix...
(2) Given the input data.table/data.frame, what would be the most efficient data-structure to use to write out a matrix, as soon above?
Here's a data.table solution that seems to be efficient. We basically doing a self join in order to create combinations and then count. Then, similar to what @coldspeed done with Numpy, we will just update a zero matrix by locations with counts.
# a self join
tmp <- dt[dt,
.(V1, id = x.V3, id2 = V3),
on = .(V1, V3 < V3),
nomatch = 0L,
allow.cartesian = TRUE
][, .N, by = .(id, id2)]
## Create a zero matrix and update by locations
m <- array(0L, rep(max(dt$V3), 2L))
m[cbind(tmp$id, tmp$id2)] <- tmp$N
m + t(m)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
# [1,] 0 2 0 0 1 0 0 1
# [2,] 2 0 0 0 1 0 0 1
# [3,] 0 0 0 0 0 0 0 0
# [4,] 0 0 0 0 1 0 1 0
# [5,] 1 1 0 1 0 0 1 1
# [6,] 0 0 0 0 0 0 0 0
# [7,] 0 0 0 1 1 0 0 0
# [8,] 1 1 0 0 1 0 0 0
Alternatively, we could create tmp
using data.table::CJ
but that could be (potentially - thanks to @Frank for the tip) less memory efficient as it will create all possible combinations first, e.g.
tmp <- dt[, CJ(V3, V3)[V1 < V2], by = .(g = V1)][, .N, by = .(V1, V2)]
## Then, as previously
m <- array(0L, rep(max(dt$V3), 2L))
m[cbind(tmp$V1, tmp$V2)] <- tmp$N
m + t(m)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
# [1,] 0 2 0 0 1 0 0 1
# [2,] 2 0 0 0 1 0 0 1
# [3,] 0 0 0 0 0 0 0 0
# [4,] 0 0 0 0 1 0 1 0
# [5,] 1 1 0 1 0 0 1 1
# [6,] 0 0 0 0 0 0 0 0
# [7,] 0 0 0 1 1 0 0 0
# [8,] 1 1 0 0 1 0 0 0
Not sure this is the most elegant approach, but it works:
myfun <- function(x, matsize=8) {
# get all (i,j) pairs but in an unfortunate text format
pairs_all <- outer(x, x, paste)
# "drop" all self-pairs like (1,1)
diag(pairs_all) <- "0 0"
# convert these text-pairs into numeric pairs and store in matrix
ij <- do.call(rbind, lapply(strsplit(pairs_all, " "), as.numeric))
# create "empty" matrix of zeros
mat <- matrix(0, nrow=matsize, ncol=matsize)
# replace each spot of empty matrix with a 1 if that pair exists
mat[ij] <- 1
# return 0/1 matrix
return(mat)
}
# split your data by group
# lapply the custom function to each group
# add each group's 0/1 matrix together for final result
Reduce('+', lapply(split(dt$V3, dt$V1), myfun))
If anyone has a more direct way to implement the first 3 (non-comment) lines of myfun
, I would happily incorporate them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With