I have a table of user_id - category pairs. Users can fall into multiple categories. I'm am try to get counts cross-category for each possible outcome. i.e. number of users who were in category A and also category C, etc.
My raw data is structured like this:
I would like results that look like this, showing counts cross-category:
How can this be accomplished in R or MySQL? The data is quite large.
Here's are sample data:
data <- structure(list(category = structure(c(1L, 2L, 2L, 1L, 3L, 3L,
2L, 1L, 3L, 2L, 2L, 2L, 3L, 1L, 1L, 3L), .Label = c("A", "B",
"C"), class = "factor"), user_id = c(464L, 345L, 342L, 312L,
345L, 234L, 423L, 464L, 756L, 756L, 345L, 345L, 464L, 345L, 234L,
312L)), .Names = c("category", "user_id"), class = "data.frame", row.names = c(NA,
-16L))
Any code snippets, thoughts on approach, functions, or package recommendations would be appreciated. Thank you! -John
In R, I would approach this by first splitting up the data by user, computing all the unique pairs of categories for that user and then grouping together:
data$category <- as.character(data$category)
(combos <- do.call(rbind, tapply(data$category, data$user_id, function(x) {
u <- unique(x)
if (length(u) > 1) t(combn(u, 2))
else NULL
})))
# [,1] [,2]
# [1,] "C" "A"
# [2,] "A" "C"
# [3,] "B" "C"
# [4,] "B" "A"
# [5,] "C" "A"
# [6,] "A" "C"
# [7,] "C" "B"
The last step is to tabulate the pairs, which can be done with the table
function in R. We'll actually use table
twice to capture (a, b) and (b, a) for each pairing of categories a and b:
table(combos[,1], combos[,2]) + table(combos[,2], combos[,1])
# A B C
# A 0 1 4
# B 1 0 2
# C 4 2 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With