I've heard about the Apriori algorithm several times before but never got the time or the opportunity to dig into it, can anyone explain to me in a simple way the workings of this algorithm? Also, a basic example would make it a lot easier for me to understand.
Apriori algorithm refers to an algorithm that is used in mining frequent products sets and relevant association rules. Generally, the apriori algorithm operates on a database containing a huge number of transactions. For example, the items customers but at a Big Bazar.
Apriori algorithm is given by R. Agrawal and R. Srikant in 1994 for finding frequent itemsets in a dataset for boolean association rule. Name of the algorithm is Apriori because it uses prior knowledge of frequent itemset properties.
Apriori is an algorithm for frequent item set mining and association rule learning over relational databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database.
Steps: Join Step: This step generates (K+1) itemset from K-item sets by joining each item with itself. Prune Step: This step scans the count of each item in the database. If the candidate item does not meet minimum support, then it is regarded as infrequent, and thus it is removed.
It is a candidate-generation-and-test approach for frequent pattern mining in datasets. There are two things you have to remember.
Apriori Pruning Principle - If any itemset is infrequent, then its superset should not be generated/tested.
Apriori Property - A given (k+1)-itemset
is a candidate (k+1)-itemset
only if everyone of its k-itemset
subsets are frequent.
Now, here is the apriori algorithm in 4 steps.
1-itemset
.k+1
candidate itemsets from length k
frequent itemsets.Suppose there is a transaction database as follows with 4 transactions including their transaction IDs and items bought with them. Assume the minimum support - min_sup
is 2
. The term support is the number of transactions in which a certain itemset is present/included.
Transaction DB
tid | items
-------------
10 | A,C,D
20 | B,C,E
30 | A,B,C,E
40 | B,E
Now, let's create the candidate 1-itemsets
by the 1st scan of the DB. It is simply called as the set of C_1
as follows.
itemset | sup
-------------
{A} | 2
{B} | 3
{C} | 3
{D} | 1
{E} | 3
If we test this with min_sup
, we can see {D}
does not satisfy the min_sup
of 2
. So, it will not be included in the frequent 1-itemset
, which we simply call as the set of L_1
as follows.
itemset | sup
-------------
{A} | 2
{B} | 3
{C} | 3
{E} | 3
Now, let's scan the DB for the 2nd time, and generate candidate 2-itemsets
, which we simply call as the set of C_2
as follows.
itemset | sup
-------------
{A,B} | 1
{A,C} | 2
{A,E} | 1
{B,C} | 2
{B,E} | 3
{C,E} | 2
As you can see, {A,B}
and {A,E}
itemsets do not satisfy the min_sup
of 2
and hence they will not be included in the frequent 2-itemset
, L_2
itemset | sup
-------------
{A,C} | 2
{B,C} | 2
{B,E} | 3
{C,E} | 2
Now let's do a 3rd scan of the DB and get candidate 3-itemsets
, C_3
as follows.
itemset | sup
-------------
{A,B,C} | 1
{A,B,E} | 1
{A,C,E} | 1
{B,C,E} | 2
You can see that, {A,B,C}
, {A,B,E}
and {A,C,E}
does not satisfy min_sup
of 2
. So they will not be included in frequent 3-itemset
, L_3
as follows.
itemset | sup
-------------
{B,C,E} | 2
Now, finally, we can calculate the support (supp)
, confidence (conf)
and lift (interestingness value)
values of the Association/Correlation Rules that can be generated by the itemset {B,C,E}
as follows.
Rule | supp | conf | lift
-------------------------------------------
B -> C & E | 50% | 66.67% | 1.33
E -> B & C | 50% | 66.67% | 1.33
C -> E & B | 50% | 66.67% | 1.77
B & C -> E | 50% | 100% | 1.33
E & B -> C | 50% | 66.67% | 1.77
C & E -> B | 50% | 100% | 1.33
See Top 10 algorithms in data mining (free access) or The Top Ten Algorithms in Data Mining. The latter gives a detailed description of the algorithm, together with details on how to get optimized implementations.
Well, I would assume you've read the wikipedia entry but you said "a basic example would make it a lot easier for me to understand". Wikipedia has just that so I'll assume you haven't read it and suggest that you do.
Read the wikipedia article.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With