Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Simple cross-tabulation in pandas

Tags:

I stumbled across pandas and it looks ideal for simple calculations that I'd like to do. I have a SAS background and was thinking it'd replace proc freq -- it looks like it'll scale to what I may want to do in the future. However, I just can't seem to get my head around a simple task (I'm not sure if I'm supposed to look at pivot/crosstab/indexing - whether I should have a Panel or DataFrames etc...). Could someone give me some pointers on how to do the following:

I have two CSV files (one for year 2010, one for year 2011 - simple transactional data) - The columns are category and amount

2010:

AB,100.00 AB,200.00 AC,150.00 AD,500.00 

2011:

AB,500.00 AC,250.00 AX,900.00 

These are loaded into separate DataFrame objects.

What I'd like to do is get the category, the sum of the category, and the frequency of the category, eg:

2010:

AB,300.00,2 AC,150.00,1 AD,500.00,1 

2011:

AB,500.00,1 AC,250.00,1 AX,900.00,1 

I can't work out whether I should be using pivot/crosstab/groupby/an index etc... I can get either the sum or the frequency - I can't seem to get both... It gets a bit more complex because I would like to do it on a month by month basis, but I think if someone would be so kind to point me to the right technique/direction I'll be able to go from there.

like image 293
Jon Clements Avatar asked Mar 06 '12 17:03

Jon Clements


People also ask

What is crosstab in machine learning?

A crosstab is a table showing the relationship between two or more variables. Where the table only shows the relationship between two categorical variables, a crosstab is also known as a contingency table.


2 Answers

v0.21 answer

Use pivot_table with the index parameter:

df.pivot_table(index='category', aggfunc=[len, sum])             len   sum          value value category             AB           2   300 AC           1   150 AD           1   500 

<= v0.12

It is possible to do this using pivot_table for those interested:

In [8]: df Out[8]:    category  value 0       AB    100 1       AB    200 2       AC    150 3       AD    500  In [9]: df.pivot_table(rows='category', aggfunc=[len, np.sum]) Out[9]:              len    sum           value  value category               AB            2    300 AC            1    150 AD            1    500 

Note that the result's columns are hierarchically indexed. If you had multiple data columns, you would get a result like this:

In [12]: df Out[12]:    category  value  value2 0       AB    100       5 1       AB    200       5 2       AC    150       5 3       AD    500       5  In [13]: df.pivot_table(rows='category', aggfunc=[len, np.sum]) Out[13]:              len            sum                   value  value2  value  value2 category                               AB            2       2    300      10 AC            1       1    150       5 AD            1       1    500       5 

The main reason to use __builtin__.sum vs. np.sum is that you get NA-handling from the latter. Probably could intercept the Python built-in, will make a note about that now.

like image 102
Wes McKinney Avatar answered Oct 01 '22 06:10

Wes McKinney


Assuming that you have a file called 2010.csv with contents

category,value AB,100.00 AB,200.00 AC,150.00 AD,500.00 

Then, using the ability to apply multiple aggregation functions following a groupby, you can say:

import pandas data_2010 = pandas.read_csv("/path/to/2010.csv") data_2010.groupby("category").agg([len, sum]) 

You should get a result that looks something like

          value                  len  sum category             AB            2  300 AC            1  150 AD            1  500 

Note that Wes will likely come by to point out that sum is optimized and that you should probably use np.sum.

like image 43
Jeff Hammerbacher Avatar answered Oct 01 '22 08:10

Jeff Hammerbacher