Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get percentage of people for each pair

Tags:

r

I have a data frame puzzle of customers and the type of item they own. A customer can occur multiple times on the list if he has several items.

name    type
m1       A
m10      A
m2       A
m9       A
m9       B
m4       B
m5       B
m1       C
m2       C
m3       C
m4       C
m5       C
m6       C
m7       C
m8       C
m1       D
m5       D

I would like calculate what percentage of people who own "A", also own "B", and so on.

Based on the above input, how can I get an output like this using R:

    A     B      C      D      TOTAL
A   1     0.25   0.5    0.25    4
B   0.33  1      0.67   0.33    3
C   0.25  0.25   1      0.25    8
D   0.5   0.5    1      1       2

Thanks a lot for your help!


Here is the long and manual way to do it, with no looping or advanced functions whatsoever (but of course that is wasted potential in R):

Example for item A:-

puzzleA <- subset(puzzle, type == 'A')

Calculating customers who own A, who also own B:-

length(unique((merge(puzzleA, puzzleB, by = 'name'))$name))/length(unique(puzzleA$name)

Data

puzzle <- structure(list(name = c("m1", "m10", "m2", "m9", "m9", "m4", 
          "m5", "m1", "m2", "m3", "m4", "m5", "m6", "m7", "m8", "m1", "m5"
          ), type = c("A", "A", "A", "A", "B", "B", "B", "C", "C", "C", 
          "C", "C", "C", "C", "C", "D", "D")), .Names = c("name", "type"
          ), class = "data.frame", row.names = c(NA, -17L))
like image 381
Vishesh Kochher Avatar asked Dec 19 '22 13:12

Vishesh Kochher


2 Answers

You could also build a set of association rules, e.g.:

library(arules)
trans <- as(lapply(split(puzzle[2], puzzle[1]), unlist, F, F), "transactions")
rules <- apriori(trans, parameter = list(support=0, minlen=2, maxlen=2, conf=0))
res <- data.frame(
  lhs = labels(lhs(rules)), 
  rhs = labels(rhs(rules)), 
  value = round(rules@quality$confidence, 2)
)
res <- reshape2::dcast(res, lhs~rhs, fill = 1)
res$total <- rowSums(trans@data)
res
#   lhs  {A}  {B}  {C}  {D} total
# 1 {A} 1.00 0.25 0.50 0.25     4
# 2 {B} 0.33 1.00 0.67 0.33     3
# 3 {C} 0.25 0.25 1.00 0.25     8
# 4 {D} 0.50 0.50 1.00 1.00     2 
like image 135
lukeA Avatar answered Dec 30 '22 23:12

lukeA


We can do this with merge/table. We merge the dataset with itself by the 'name', remove the first column, get the frequency count with table ('tbl'), divide it by the diagonal elements of 'tbl', and cbind with the diagonal elements.

tbl <- table(merge(puzzle, puzzle, by = "name")[-1])
cbind(round(tbl/diag(tbl),2), TOTAL= diag(tbl))
#     A    B    C    D TOTAL
#A 1.00 0.25 0.50 0.25     4
#B 0.33 1.00 0.67 0.33     3
#C 0.25 0.25 1.00 0.25     8
#D 0.50 0.50 1.00 1.00     2
like image 39
akrun Avatar answered Dec 31 '22 01:12

akrun