Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's a good data model for cross-tabulation?

I'm implementing a cross-tabulation library in Python as a programming exercise for my new job, and I've got an implementation of the requirements that works but is inelegant and redundant. I'd like a better model for it, something that allows a nice, clean movement of data between the base model, stored as tabular data in flat files, and all of the statistical analysis results that might be asked of this.

Right now, I have a progression from a set of tuples for each row in the table, to a histogram counting the frequencies of the appearances of the tuples of interest, to a serializer that -- somewhat clumsily -- compiles the output into a set of table cells for display. However, I end up having to go back up to the table or to the histogram more often than I want to because there's never enough information in place.

So, any ideas?

Edit: Here's an example of some data, and what I want to be able to build from it. Note that "." denotes a bit of 'missing' data, that is only conditionally counted.

1   .   1
1   0   3
1   0   3
1   2   3
2   .   1
2   0   .
2   2   2
2   2   4
2   2   .

If I were looking at the correlation between columns 0 and 2 above, this is the table I'd have:

    . 1 2 3 4
1   0 1 0 3 0
2   2 1 1 0 1

In addition, I'd want to be able to calculate ratio of frequency/total, frequency/subtotal, &c.

like image 635
Chris R Avatar asked Jun 19 '09 19:06

Chris R


2 Answers

You could use an in-memory sqlite database as a data structure, and define the desired operations as SQL queries.

import sqlite3

c = sqlite3.Connection(':memory:')
c.execute('CREATE TABLE data (a, b, c)')

c.executemany('INSERT INTO data VALUES (?, ?, ?)', [
    (1, None,    1),
    (1,    0,    3),
    (1,    0,    3),
    (1,    2,    3),
    (2, None,    1),
    (2,    0, None),
    (2,    2,    2),
    (2,    2,    4),
    (2,    2, None),
])

# queries
# ...
like image 116
Roberto Bonvallet Avatar answered Oct 17 '22 02:10

Roberto Bonvallet


S W has posted a good basic recipe for this on activestate.com.

The essence seems to be...

  1. Define xsort=[] and ysort=[] as arrays of your axes. Populate them by iterating through your data, or some other way.
  2. Define rs={} as a dict of dicts of your tabulated data, by iterating through your data and incrementing rs[yvalue][xvalue]. Create missing keys if/when needed.

Then for example the total for row y would be sum([rs[y][x] for x in xsort])

like image 38
krubo Avatar answered Oct 17 '22 01:10

krubo