Problem:
I have a list of millions of transactions. Each transaction contains items (eg 'carrots', 'apples') the goal is to generate a list of pair of items that frequently occur together in individual transactions. As far as I can tell doing an exhaustive search isn't feasible.
Solution attempts:
So far I have two ideas. 1) Randomly sample some appropriate fraction of transactions and only check those or 2) count how often each element appears, use that data to calculate how often elements should appear together by chance and use that to modify the estimate from 1.
Any tips, alternative approaches, ready-made solutions or just general reading suggestions are much appreciated.
Edit:
Some additional information from the comments
Number of diffent items: 1,000 to 100,000
Memory constraint: A few gigs of ram at the most for a few hours.
Frequency of use: More or less a one off.
Available resources: 20-100 hours of newbie programmer time.
Desired result list format: Pair of items and some measure how often they appear, for the n most frequent pairs.
Distribution of items per transactions: Unknown as of now.
Let the number of transactions be n
, the number of items be k
, and the average size of a transaction be d
.
The naive approach (check pair in all records) will give you O(k^2 * n * d)
solution, not very optimal indeed. But we can improve it to O(k*n*d)
, and if we assume uniform distribution of items (i.e. each items repeats on average O(n*d/k)
times) - we might be able to improve it to O(d^2 * n + k^2)
(which is much better, since most likely d << k
).
This can be done by building an inverted index of your transactions, meaning - create a map from the items to the transactions containing them (Creating the index is O(nd + k)
).
Example, if you have transactions
transaction1 = ('apple','grape')
transaction2 = ('apple','banana','mango')
transaction3 = ('grape','mango')
The inverted index will be:
'apple' -> [1,2]
'grape' -> [1,3]
'banana' -> [2]
'mango' -> [2,3]
So, after understanding what an inverted index is - here is the guidelines for the solution:
(x,y)
such that y
co-occures with x
.Complexity analysis:
O(nd+k)
O(nd/k)
transactions, each iteration takes O(nd/k * d)
time, and you have k
iterations in this step, so you get O(nd^2 + k)
for this step.O(k^2)
.Totaling in O(nd^2 + k^2)
solution to get top-X elements, which is MUCH better then naive approach, assuming d << k
.
In addition, note that the bottleneck (step 2) can be efficiently parallelized and distributed among threads if needed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With