I have data on customers and the different products they have purchased:
Customer Product
1 A
1 B
1 C
2 D
2 E
2 F
3 A
3 B
3 D
4 A
4 B
I would like to check which sets of products that occur together across different customers. I want to get the count for product combinations of different lengths. For example, the product combination A and B together occurs in three different customers; the product group A, B and C occurs in one customer. And so on for all different sets of 2 or more products in the data. Something like:
Product Group Number
A, B, C 1
D, E, F 1
A, B, D 1
A, B 3
Thus, I'm counting the A, B combination in customers who only have product A and B (e.g. customer 4), and in customers who have A and B, but also any other product (e.g. customer 1, who has A, B and C).
Does anyone have any ideas how to do that with either a tidyverse
or base
R approach? I feel like it ought to be pretty trivial - maybe pivot_wider
first, then count?
I have found this question and answer that can do what I need for pairs of products, but I need to count combinations also for more products than two.
If you have the possibility to use a non-base
package, you can use a tool dedicated for the task of finding item sets: arules::apriori
. It is much faster on larger data sets.
library(arules)
# coerce data frame to binary incidence matrix
# use apriori to get "frequent itemsets"
r = apriori(data = as.matrix(table(dat) > 0),
# set: type of association mined, minimal support needed of an item set,
# minimal number of items per item set
par = list(target = "frequent itemsets",
support = 0,
minlen = 2))
# coerce itemset to data.frame, select relevant rows and columns
d = as(r, "data.frame")
d[d$count > 0, c("items", "count")]
# items count
# 4 {B,C} 1
# 5 {A,C} 1
# 6 {E,F} 1
# 7 {D,E} 1
# 10 {D,F} 1
# 13 {B,D} 1
# 14 {A,D} 1
# 15 {A,B} 3
# 25 {A,B,C} 1
# 26 {D,E,F} 1
# 35 {A,B,D} 1
Timing on larger data set: 10000 customers with up to 6 products each. apriori
is quite a lot faster.
# Unit: milliseconds
# expr min lq mean median uq max neval
# f_henrik(dat) 38.95475 39.8621 41.44454 40.67313 41.05565 57.64655 20
# f_allan(dat) 4578.20595 4622.2363 4664.57187 4654.58713 4679.78119 4924.22537 20
# f_jay(dat) 2799.10516 2939.9727 2995.90038 2971.24127 2999.82019 3444.70819 20
# f_uwe_dt(dat) 2943.26219 3007.1212 3028.37550 3027.46511 3060.38380 3076.25664 20
# f_uwe_dplyr(dat) 6339.03141 6375.7727 6478.77979 6448.56399 6521.54196 6816.09911 20
10000 customers with up to 10 products each. apriori
is several hundred times faster.
# Unit: milliseconds
# expr min lq mean median uq max neval
# f_henrik(dat) 58.40093 58.95241 59.71129 59.63988 60.43591 61.21082 20
# f_jay(dat) 52824.67760 53369.78899 53760.43652 53555.69881 54049.91600 55605.47980 20
# f_uwe_dt(dat) 22612.87954 22820.12012 22998.85072 22974.32710 23220.00390 23337.22815 20
# f_uwe_dplyr(dat) 26083.20240 26255.88861 26445.49295 26402.67887 26659.81195 27046.83491 20
On the larger data set, Allan's code gave warnings (In rawToBits(as.raw(x)) : out-of-range values treated as 0 in coercion to raw
) on the toy data, which seemed to affect the result. Thus, it is not included in the second benchmark.
Data and benchmark code:
set.seed(3)
n_cust = 10000
n_product = sample(2:6, n_cust, replace = TRUE) # 2:10 in second run
dat = data.frame(
Customer = rep(1:n_cust, n_product),
Product = unlist(lapply(n_product, function(n) sample(letters[1:6], n)))) # 1:10 in 2nd run
library(microbenchmark)
res = microbenchmark(f_henrik(dat),
f_allan(dat),
f_jay(dat),
f_uwe_dt(dat),
f_uwe_dplyr(dat),
times = 20L)
Check for equality:
henrik = f_henrik(dat)
allan = f_allan(dat)
jay = f_jay(dat)
uwe_dt = f_uwe_dt(dat)
uwe_dplyr = f_uwe_dplyr(dat)
# change outputs to common format for comparison
# e.g. string format, column names, order
henrik$items = substr(henrik$items, 2, nchar(henrik$items) - 1)
henrik$items = gsub(",", ", ", henrik$items)
l = list(
henrik = henrik, allan = allan, jay = jay, uwe_dt = uwe_dt, uwe_dplyr = uwe_dplyr)
l = lapply(l, function(d){
d = setNames(as.data.frame(d), c("items", "count"))
d = d[order(d$items), ]
row.names(d) = NULL
d
})
all.equal(l[["henrik"]], l[["allan"]])
# TRUE
all.equal(l[["henrik"]], l[["jay"]])
# TRUE
all.equal(l[["henrik"]], l[["uwe_dt"]])
# TRUE
all.equal(l[["henrik"]], l[["uwe_dplyr"]])
# TRUE
Functions:
f_henrik = function(dat){
r = apriori(data = as.matrix(table(dat) > 0),
par = list(target = "frequent itemsets",
support = 0,
minlen = 2))
d = as(r, "data.frame")
d[d$count > 0, c("items", "count")]
}
f_allan = function(dat){
all_multiples <- function(strings)
{
n <- length(strings)
do.call("c", sapply(1:2^n, function(x) {
mystrings <- strings[as.character(rawToBits(as.raw(x))[seq(n)]) == "01"]
if (length(mystrings) > 1) paste(mystrings, collapse = ", ") else NULL
}))
}
dat %>%
group_by(Customer) %>%
arrange(Product) %>%
summarize(Product_group = all_multiples(Product)) %>%
group_by(Product_group) %>%
count(Product_group)
}
f_jay = function(dat){
a <- split(dat$Product, dat$Customer) ## thx to @Henrik
r <- range(lengths(a))
pr <- unlist(lapply(r[1]:r[2], function(x)
combn(unique(dat$Product), x, list)), recursive=F)
or <- rowSums(outer(pr, a, Vectorize(function(x, y) all(x %in% y))))
res <- data.frame(p.group=sapply(pr, toString), number=or)
res[res$number > 0, ]
}
f_uwe_dt = function(dat){
setorder(setDT(dat), Customer, Product)
dat[, .(Product.Group = unlist(lapply(tail(seq(.N), -1L),
function(m) combn(unique(Product), m, toString, FALSE)))),
by = Customer][
, .N, by = Product.Group]
}
f_uwe_dplyr = function(dat){
dat %>%
arrange(Customer, Product) %>%
group_by(Customer) %>%
summarise(Product.Group = n() %>%
seq() %>%
tail(-1L) %>%
lapply(function(m) combn(unique(Product), m, toString, FALSE)) %>%
unlist()) %>%
ungroup() %>%
count(Product.Group)
}
If you define a little helper function that gets all multiple groupings:
all_multiples <- function(strings)
{
n <- length(strings)
do.call("c", sapply(1:2^n, function(x) {
mystrings <- strings[as.character(rawToBits(as.raw(x))[seq(n)]) == "01"]
if (length(mystrings) > 1) paste(mystrings, collapse = ", ") else NULL
}))
}
then you can do this nicely in a tidyverse pipe:
dat %>%
group_by(Customer) %>%
arrange(Product) %>%
summarize(Product_group = all_multiples(Product)) %>%
group_by(Product_group) %>%
count(Product_group)
#> # A tibble: 11 x 2
#> # Groups: Product_group [11]
#> Product_group n
#> <chr> <int>
#> 1 A, B 3
#> 2 A, B, C 1
#> 3 A, B, D 1
#> 4 A, C 1
#> 5 A, D 1
#> 6 B, C 1
#> 7 B, D 1
#> 8 D, E 1
#> 9 D, E, F 1
#> 10 D, F 1
#> 11 E, F 1
For the sake of completeness, here is a solution in data.table
syntax which can be translated to dplyr
syntax as well.
For both implementations, the core idea is the same:
Product
(which is an important step which has been neglected by the other answers posted so far)Customer
, create the product groups by using combn()
with varying lengths m
. Product.Group
is a kind of natural key created by concatenating the included products using the toString()
function.Product
is important : products B
, A
as well as A
, B
should appear in the same product group A, B
.Product.Group
data.table
versionlibrary(data.table)
setorder(setDT(df), Customer, Product)
df[, .(Product.Group = unlist(lapply(tail(seq(.N), -1L),
function(m) combn(unique(Product), m, toString, FALSE)))),
by = Customer][
, .N, by = Product.Group]
Product.Group N 1: A, B 3 2: A, C 1 3: B, C 1 4: A, B, C 1 5: D, E 1 6: D, F 1 7: E, F 1 8: D, E, F 1 9: A, D 1 10: B, D 1 11: A, B, D 1
dplyr
versionlibrary(dplyr)
df %>%
arrange(Customer, Product) %>%
group_by(Customer) %>%
summarise(Product.Group = n() %>%
seq() %>%
tail(-1L) %>%
lapply(function(m) combn(unique(Product), m, toString, FALSE)) %>%
unlist()) %>%
ungroup() %>%
count(Product.Group)
Product.Group n <chr> <int> 1 A, B 3 2 A, B, C 1 3 A, B, D 1 4 A, C 1 5 A, D 1 6 B, C 1 7 B, D 1 8 D, E 1 9 D, E, F 1 10 D, F 1 11 E, F 1
library(data.table)
df <- fread("
Customer Product
1 A
1 B
1 C
2 D
2 E
2 F
3 A
3 B
3 D
4 A
4 B")
You could split
the data along customers, then get all combinations of product-pairs and triples using combn
. Then find matches using %in%
with outer
, create data frame by collapsing products using toString
and finally discard elements which are zero.
# a <- aggregate(Product ~ Customer, dat, I)$Product ## old solution
# if (is.matrix(a)) a <- as.data.frame(t(a)) ## old solution
a <- split(dat$Product, dat$Customer) ## thx to @Henrik
r <- range(lengths(a))
pr <- unlist(lapply(r[1]:r[2], function(x)
combn(unique(dat$Product), x, list)), recursive=F)
or <- rowSums(outer(pr, a, Vectorize(function(x, y) all(x %in% y))))
res <- data.frame(p.group=sapply(pr, toString), number=or)
res[res$number > 0, ]
# p.group number
# 1 A, B 3
# 2 A, C 1
# 3 A, D 1
# 6 B, C 1
# 7 B, D 1
# 13 D, E 1
# 14 D, F 1
# 15 E, F 1
# 16 A, B, C 1
# 17 A, B, D 1
# 35 D, E, F 1
Data
dat <- read.table(header=TRUE, text="Customer Product
1 A
1 B
1 C
2 D
2 E
2 F
3 A
3 B
3 D
4 A
4 B")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With