I stumbled across pandas and it looks ideal for simple calculations that I'd like to do. I have a SAS background and was thinking it'd replace proc freq -- it looks like it'll scale to what I may want to do in the future. However, I just can't seem to get my head around a simple task (I'm not sure if I'm supposed to look at pivot/crosstab/indexing
- whether I should have a Panel
or DataFrames
etc...). Could someone give me some pointers on how to do the following:
I have two CSV files (one for year 2010, one for year 2011 - simple transactional data) - The columns are category and amount
2010:
AB,100.00 AB,200.00 AC,150.00 AD,500.00
2011:
AB,500.00 AC,250.00 AX,900.00
These are loaded into separate DataFrame objects.
What I'd like to do is get the category, the sum of the category, and the frequency of the category, eg:
2010:
AB,300.00,2 AC,150.00,1 AD,500.00,1
2011:
AB,500.00,1 AC,250.00,1 AX,900.00,1
I can't work out whether I should be using pivot/crosstab/groupby/an index
etc... I can get either the sum or the frequency - I can't seem to get both... It gets a bit more complex because I would like to do it on a month by month basis, but I think if someone would be so kind to point me to the right technique/direction I'll be able to go from there.
A crosstab is a table showing the relationship between two or more variables. Where the table only shows the relationship between two categorical variables, a crosstab is also known as a contingency table.
v0.21
answer
Use pivot_table
with the index
parameter:
df.pivot_table(index='category', aggfunc=[len, sum]) len sum value value category AB 2 300 AC 1 150 AD 1 500
<= v0.12
It is possible to do this using pivot_table
for those interested:
In [8]: df Out[8]: category value 0 AB 100 1 AB 200 2 AC 150 3 AD 500 In [9]: df.pivot_table(rows='category', aggfunc=[len, np.sum]) Out[9]: len sum value value category AB 2 300 AC 1 150 AD 1 500
Note that the result's columns are hierarchically indexed. If you had multiple data columns, you would get a result like this:
In [12]: df Out[12]: category value value2 0 AB 100 5 1 AB 200 5 2 AC 150 5 3 AD 500 5 In [13]: df.pivot_table(rows='category', aggfunc=[len, np.sum]) Out[13]: len sum value value2 value value2 category AB 2 2 300 10 AC 1 1 150 5 AD 1 1 500 5
The main reason to use __builtin__.sum
vs. np.sum
is that you get NA-handling from the latter. Probably could intercept the Python built-in, will make a note about that now.
Assuming that you have a file called 2010.csv with contents
category,value AB,100.00 AB,200.00 AC,150.00 AD,500.00
Then, using the ability to apply multiple aggregation functions following a groupby, you can say:
import pandas data_2010 = pandas.read_csv("/path/to/2010.csv") data_2010.groupby("category").agg([len, sum])
You should get a result that looks something like
value len sum category AB 2 300 AC 1 150 AD 1 500
Note that Wes will likely come by to point out that sum is optimized and that you should probably use np.sum.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With