Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to generate random contingency tables?

What is an efficient way to generate a random contingency table? A contingency table is defined as a rectangular matrix such that the sum of each row is fixed, and the sum of each column is fixed, but the individual elements may be anything as long as the sum of each row and column is correct.

Note that it's very easy to generate random contingency tables, but I'm looking for something more efficient than the naive algorithm.

like image 266
dsimcha Avatar asked Jun 04 '09 02:06

dsimcha


People also ask

How do you create a contingency table?

Creating a basic contingency table. To create a contingency table of the data in the var1 column cross-classified with the data in the var2 column, choose the Stat > Tables > Contingency > With Data menu option. Select var1 as the Row variable, choose var2 as the Column variable, and click Compute!.

How do you calculate contingency tables?

The grand total is the number of outcomes for the denominator. Consequently, to calculate joint probabilities in a contingency table, take each cell count and divide by the grand total. For our example, the joint probability of females buying Macs equals the value in that cell (87) divided by the grand total (223).

Why do we use contingency tables instead of graph in predictive analytics?

Use contingency tables to understand the relationship between categorical variables. For example, is there a relationship between gender (male/female) and type of computer (Mac/PC)? I love these tables because they organize your data and allow you to answer diverse questions.


2 Answers

Looking at the code of the networksis package for R might be helpful. I believe that efficient computation requires fancy Markov Chain sequential importance resampling techniques, so you might want to avoid reimplementing this if you can avoid it.

Edit: The relevant paper is Chen, Diaconis, Holmes, and Liu (2005). In the words of the authors, "[o]ur method compares favorably with other existing Monte Carlo- based algorithms, and sometimes is a few orders of magnitude more efficient."

like image 96
othercriteria Avatar answered Sep 24 '22 03:09

othercriteria


This sounds like a constraint satisfaction problem (CSP) to me.

You would basically start at some point and choose a cell's value randomly from the set of allowed values. Then you update the sets of eligible values for all cells in the same row/column and choose the next cell (according to the CSP heuristic you are using) to (randomly) assign a value to, again from its set of eligible values. Again, you also have to update the sets of eligible values for all cells in the same row/column. In case you encounter a cell that has an empty set of eligible values, you have to do backtracking.

However, the notion of 'set of eligible values' might be hard to represent in a data structure, depending on the range of values you are allowing.

like image 20
Roland Ewald Avatar answered Sep 23 '22 03:09

Roland Ewald