I'm implementing a cross-tabulation library in Python as a programming exercise for my new job, and I've got an implementation of the requirements that works but is inelegant and redundant. I'd like a better model for it, something that allows a nice, clean movement of data between the base model, stored as tabular data in flat files, and all of the statistical analysis results that might be asked of this.
Right now, I have a progression from a set of tuples for each row in the table, to a histogram counting the frequencies of the appearances of the tuples of interest, to a serializer that -- somewhat clumsily -- compiles the output into a set of table cells for display. However, I end up having to go back up to the table or to the histogram more often than I want to because there's never enough information in place.
So, any ideas?
Edit: Here's an example of some data, and what I want to be able to build from it. Note that "." denotes a bit of 'missing' data, that is only conditionally counted.
1 . 1
1 0 3
1 0 3
1 2 3
2 . 1
2 0 .
2 2 2
2 2 4
2 2 .
If I were looking at the correlation between columns 0 and 2 above, this is the table I'd have:
. 1 2 3 4
1 0 1 0 3 0
2 2 1 1 0 1
In addition, I'd want to be able to calculate ratio of frequency/total, frequency/subtotal, &c.
You could use an in-memory sqlite
database as a data structure, and define the desired operations as SQL queries.
import sqlite3
c = sqlite3.Connection(':memory:')
c.execute('CREATE TABLE data (a, b, c)')
c.executemany('INSERT INTO data VALUES (?, ?, ?)', [
(1, None, 1),
(1, 0, 3),
(1, 0, 3),
(1, 2, 3),
(2, None, 1),
(2, 0, None),
(2, 2, 2),
(2, 2, 4),
(2, 2, None),
])
# queries
# ...
S W has posted a good basic recipe for this on activestate.com.
The essence seems to be...
Then for example the total for row y would be sum([rs[y][x] for x in xsort])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With