I have a data frame puzzle
of customers and the type of item they own. A customer can occur multiple times on the list if he has several items.
name type
m1 A
m10 A
m2 A
m9 A
m9 B
m4 B
m5 B
m1 C
m2 C
m3 C
m4 C
m5 C
m6 C
m7 C
m8 C
m1 D
m5 D
I would like calculate what percentage of people who own "A", also own "B", and so on.
Based on the above input, how can I get an output like this using R:
A B C D TOTAL
A 1 0.25 0.5 0.25 4
B 0.33 1 0.67 0.33 3
C 0.25 0.25 1 0.25 8
D 0.5 0.5 1 1 2
Thanks a lot for your help!
Here is the long and manual way to do it, with no looping or advanced functions whatsoever (but of course that is wasted potential in R):
Example for item A:-
puzzleA <- subset(puzzle, type == 'A')
Calculating customers who own A, who also own B:-
length(unique((merge(puzzleA, puzzleB, by = 'name'))$name))/length(unique(puzzleA$name)
Data
puzzle <- structure(list(name = c("m1", "m10", "m2", "m9", "m9", "m4",
"m5", "m1", "m2", "m3", "m4", "m5", "m6", "m7", "m8", "m1", "m5"
), type = c("A", "A", "A", "A", "B", "B", "B", "C", "C", "C",
"C", "C", "C", "C", "C", "D", "D")), .Names = c("name", "type"
), class = "data.frame", row.names = c(NA, -17L))
You could also build a set of association rules, e.g.:
library(arules)
trans <- as(lapply(split(puzzle[2], puzzle[1]), unlist, F, F), "transactions")
rules <- apriori(trans, parameter = list(support=0, minlen=2, maxlen=2, conf=0))
res <- data.frame(
lhs = labels(lhs(rules)),
rhs = labels(rhs(rules)),
value = round(rules@quality$confidence, 2)
)
res <- reshape2::dcast(res, lhs~rhs, fill = 1)
res$total <- rowSums(trans@data)
res
# lhs {A} {B} {C} {D} total
# 1 {A} 1.00 0.25 0.50 0.25 4
# 2 {B} 0.33 1.00 0.67 0.33 3
# 3 {C} 0.25 0.25 1.00 0.25 8
# 4 {D} 0.50 0.50 1.00 1.00 2
We can do this with merge/table
. We merge
the dataset with itself by
the 'name', remove the first column, get the frequency count with table
('tbl'), divide it by the diagonal elements of 'tbl', and cbind
with the diagonal elements.
tbl <- table(merge(puzzle, puzzle, by = "name")[-1])
cbind(round(tbl/diag(tbl),2), TOTAL= diag(tbl))
# A B C D TOTAL
#A 1.00 0.25 0.50 0.25 4
#B 0.33 1.00 0.67 0.33 3
#C 0.25 0.25 1.00 0.25 8
#D 0.50 0.50 1.00 1.00 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With