Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating a Cross-frequency table in R or MySQL

Tags:

mysql

r

I have a table of user_id - category pairs. Users can fall into multiple categories. I'm am try to get counts cross-category for each possible outcome. i.e. number of users who were in category A and also category C, etc.

My raw data is structured like this:

example results

I would like results that look like this, showing counts cross-category:

example results

How can this be accomplished in R or MySQL? The data is quite large.

Here's are sample data:

data <- structure(list(category = structure(c(1L, 2L, 2L, 1L, 3L, 3L, 
2L, 1L, 3L, 2L, 2L, 2L, 3L, 1L, 1L, 3L), .Label = c("A", "B", 
"C"), class = "factor"), user_id = c(464L, 345L, 342L, 312L, 
345L, 234L, 423L, 464L, 756L, 756L, 345L, 345L, 464L, 345L, 234L, 
312L)), .Names = c("category", "user_id"), class = "data.frame", row.names = c(NA, 
-16L))

Any code snippets, thoughts on approach, functions, or package recommendations would be appreciated. Thank you! -John

like image 891
Super_John Avatar asked Oct 19 '22 09:10

Super_John


1 Answers

In R, I would approach this by first splitting up the data by user, computing all the unique pairs of categories for that user and then grouping together:

data$category <- as.character(data$category)
(combos <- do.call(rbind, tapply(data$category, data$user_id, function(x) {
  u <- unique(x)
  if (length(u) > 1) t(combn(u, 2))
  else NULL
})))
#      [,1] [,2]
# [1,] "C"  "A" 
# [2,] "A"  "C" 
# [3,] "B"  "C" 
# [4,] "B"  "A" 
# [5,] "C"  "A" 
# [6,] "A"  "C" 
# [7,] "C"  "B" 

The last step is to tabulate the pairs, which can be done with the table function in R. We'll actually use table twice to capture (a, b) and (b, a) for each pairing of categories a and b:

table(combos[,1], combos[,2]) + table(combos[,2], combos[,1])
#     A B C
#   A 0 1 4
#   B 1 0 2
#   C 4 2 0
like image 89
josliber Avatar answered Oct 22 '22 00:10

josliber