I'm writting a piece of code to evaluate my Clustering Algorithm and I find that every kind of evaluation method needs the basic data from a <code>m*n</code> matrix like <code>A = {aij}</code> where <code>aij</code> is the number of data points that are members of class <code>ci</code> and elements of cluster <code>kj</code>. But there appear to be two of this type of matrix in Introduction to Data Mining (Pang-Ning Tan et al.), one is the Confusion Matrix, the other is the Contingency Table. I do not fully understand the difference between the two. Which best describes the matrix I want to use?

Wikipedia's definition: <blockquote> In the field of artificial intelligence, a confusion matrix is a visualization tool typically used in supervised learning (in unsupervised learning it is typically called a matching matrix). Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class. </blockquote> Confusion matrix should be clear, it basically tells how many actual results match the predicted results. For example, see this confusion matrix <pre class="prettyprint"><code> predicted class c1 - c2 Actual class c1 15 - 3 ___________________ c2 0 - 2 </code></pre> It tells that: <ol> <li>Column1, row 1 means that the classifier has predicted 15 items as belonging to class <code>c1</code>, and actually 15 items belong to class <code>c1</code> (which is a correct prediction)</li> <li>the second column row 1 tells that the classifier has predicted that 3 items belong to class <code>c2</code>, but they actually belong to class <code>c1</code> (which is a wrong prediction)</li> <li>Column 1 row 2 means that none of the items that actually belong to class <code>c2</code> have been predicted to belong to class <code>c1</code> (which is a wrong prediction)</li> <li>Column 2 row 2 tells that 2 items that belong to class <code>c2</code> have been predicted to belong to class <code>c2</code> (which is a correct prediction)</li> </ol> Now see the formula of Accuracy and Error Rate from your book (Chapter 4, 4.2), and you should be able to clearly understand what is a confusion matrix. It is used to test the accuracy of a classifier using data with known results. The K-Fold method (also mentioned in the book) is one of the methods to calculate the accuracy of a classifier that has also been mentioned in your book. Now, for Contingency table: Wikipedia's definition: <blockquote> In statistics, a contingency table (also referred to as cross tabulation or cross tab) is a type of table in a matrix format that displays the (multivariate) frequency distribution of the variables. It is often used to record and analyze the relation between two or more categorical variables. </blockquote> In data mining, contingency tables are used to show what items appeared in a reading together, like in a transaction or in the shopping-cart of a sales analysis. For example (this is the example from the book you have mentioned): <pre class="prettyprint"><code> Coffee !coffee tea 150 50 200 !tea 650 150 800 800 200 1000 </code></pre> It tells that in 1000 responses (responses about do they like Coffee and tea or both or one of them, results of a survey): <ol> <li>150 people like both tea and coffee</li> <li>50 people like tea but do not like coffee</li> <li>650 people do not like tea but like coffee</li> <li>150 people like neither tea nor coffee</li> </ol> Contingency tables are used to find the Support and Confidence of association rules, basically to evaluate association rules (read Chapter 6, 6.7.1). Now the difference is that Confusion Matrix is used to evaluate the performance of a classifier, and it tells how accurate a classifier is in making predictions about classification, and contingency table is used to evaluate association rules. Now after reading the answer, google a bit (always use google while you are reading your book), read what is in the book, see a few examples, and don't forget to solve a few exercises given in the book, and you should have a clear concept about both of them, and also what to use in a certain situation and why. Hope this helps.

What is the difference between a Confusion Matrix and Contingency Table?

Tags:

I'm writting a piece of code to evaluate my Clustering Algorithm and I find that every kind of evaluation method needs the basic data from a m*n matrix like A = {aij} where aij is the number of data points that are members of class ci and elements of cluster kj.

But there appear to be two of this type of matrix in Introduction to Data Mining (Pang-Ning Tan et al.), one is the Confusion Matrix, the other is the Contingency Table. I do not fully understand the difference between the two. Which best describes the matrix I want to use?

249

asked Sep 30 '11 15:09

MangMang

1 Answers

Wikipedia's definition:

In the field of artificial intelligence, a confusion matrix is a visualization tool typically used in supervised learning (in unsupervised learning it is typically called a matching matrix). Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class.

Confusion matrix should be clear, it basically tells how many actual results match the predicted results. For example, see this confusion matrix

                 predicted class                         c1  -  c2   Actual class   c1     15  -   3                 ___________________                  c2     0   -   2

It tells that:

Column1, row 1 means that the classifier has predicted 15 items as belonging to class c1, and actually 15 items belong to class c1 (which is a correct prediction)
the second column row 1 tells that the classifier has predicted that 3 items belong to class c2, but they actually belong to class c1 (which is a wrong prediction)
Column 1 row 2 means that none of the items that actually belong to class c2 have been predicted to belong to class c1 (which is a wrong prediction)
Column 2 row 2 tells that 2 items that belong to class c2 have been predicted to belong to class c2 (which is a correct prediction)

Now see the formula of Accuracy and Error Rate from your book (Chapter 4, 4.2), and you should be able to clearly understand what is a confusion matrix. It is used to test the accuracy of a classifier using data with known results. The K-Fold method (also mentioned in the book) is one of the methods to calculate the accuracy of a classifier that has also been mentioned in your book.

Now, for Contingency table: Wikipedia's definition:

In statistics, a contingency table (also referred to as cross tabulation or cross tab) is a type of table in a matrix format that displays the (multivariate) frequency distribution of the variables. It is often used to record and analyze the relation between two or more categorical variables.

In data mining, contingency tables are used to show what items appeared in a reading together, like in a transaction or in the shopping-cart of a sales analysis. For example (this is the example from the book you have mentioned):

       Coffee  !coffee tea    150       50      200 !tea   650       150     800        800       200    1000

It tells that in 1000 responses (responses about do they like Coffee and tea or both or one of them, results of a survey):

150 people like both tea and coffee
50 people like tea but do not like coffee
650 people do not like tea but like coffee
150 people like neither tea nor coffee

Contingency tables are used to find the Support and Confidence of association rules, basically to evaluate association rules (read Chapter 6, 6.7.1).

Now the difference is that Confusion Matrix is used to evaluate the performance of a classifier, and it tells how accurate a classifier is in making predictions about classification, and contingency table is used to evaluate association rules.

Now after reading the answer, google a bit (always use google while you are reading your book), read what is in the book, see a few examples, and don't forget to solve a few exercises given in the book, and you should have a clear concept about both of them, and also what to use in a certain situation and why.

Hope this helps.

197

answered Sep 23 '22 17:09

SpeedBirdNine

Related questions
                            
                                Pointer initialization in C
                            
                                WCF Channel and ChannelFactory Caching
                            
                                lambda returning bool
                            
                                Disadvantages to rack-cache vs. Varnish in Heroku cedar stack?
                            
                                Turn off verbose sql/ActiveRecord for Rails 3.1.1
                            
                                How to hide & unhide Master View Controller in SplitView Controller
                            
                                Does making parent clickable make all child element clickable as well?
                            
                                shift bits vs multiply in PHP
                            
                                Properties reference for hibernate in persistence.xml
                            
                                how to load url into div tag
                            
                                In Git, local branches can track one another - how is this useful?
                            
                                Can I declare constant integers with a thousands separator in C#?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With