I am new to this area as well as the terminology so please feel free to suggest if I go wrong somewhere. I have two datasets like this:
A B C 0 E A 0 C 0 0 A 0 C D E A 0 C 0 E
The way I interpret this is at some point in time, (A,B,C,E) occurred together and so did (A,C), (A,C,D,E) etc.
5A 1B 5C 0 2E 4A 0 5C 0 0 2A 0 1C 4D 4E 3A 0 4C 0 3E
The way I interpret this is at some point in time, 5 occurrences of A, 1 occurrence of B, 5 occurrences of C and 2 occurrences of E happened and so on.
I am trying to find what items occur together and if possible, also find out the cause and effect for this. For this, I am not understanding how to go about using both the datasets (or if one is enough). It would be good to have a good tutorial on this but my primary question is which dataset to utilize and how to proceed in (i) building a frequent itemset and (ii) building association rules between them.
Can someone point me to a practical tutorials/examples (preferably in Python) or at least explain in brief words on how to approach this problem?
Association rules are created by searching data for frequent if-then patterns and using the criteria support and confidence to identify the most important relationships. Support is an indication of how frequently the items appear in the data.
Frequent itemsets are the ones which occur at least a minimum number of times in the transactions. Technically, these are the itemsets for which support value (fraction of transactions containing the itemset) is above a minimum threshold — minsup.
Many algorithms for generating association rules have been proposed. Some well-known algorithms are Apriori, Eclat and FP-Growth, but they only do half the job, since they are algorithms for mining frequent itemsets. Another step needs to be done after to generate rules from frequent itemsets found in a database.
Some theoretical facts about association rules:
To find association rules, you can use apriori algorithm. There already exists many python implementation, although most of them are not efficient for practical usage:
or use Orange data mining library, which has a good library for association rules.
Usage example:
''' save first example as item.basket with format A, B, C, E A, C A, C, D, E A, C, E open ipython same directory as saved file or use os module >>> import os >>> os.chdir("c:/orange") ''' import orange items = orange.ExampleTable("item") #play with support argument to filter out rules rules = orange.AssociationRulesSparseInducer(items, support = 0.1) for r in rules: print "%5.3f %5.3f %s" % (r.support, r.confidence, r)
To learn more about association rules/frequent item mining, then my selection of books are:
There is no short way.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With