My problem is how to calculate frequencies on multiple variables in pandas . I have from this dataframe :
d1 = pd.DataFrame( {'StudentID': ["x1", "x10", "x2","x3", "x4", "x5", "x6", "x7", "x8", "x9"],
'StudentGender' : ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M'],
'ExamenYear': ['2007','2007','2007','2008','2008','2008','2008','2009','2009','2009'],
'Exam': ['algebra', 'stats', 'bio', 'algebra', 'algebra', 'stats', 'stats', 'algebra', 'bio', 'bio'],
'Participated': ['no','yes','yes','yes','no','yes','yes','yes','yes','yes'],
'Passed': ['no','yes','yes','yes','no','yes','yes','yes','no','yes']},
columns = ['StudentID', 'StudentGender', 'ExamenYear', 'Exam', 'Participated', 'Passed'])
To the following result
Participated OfWhichpassed
ExamenYear
2007 3 2
2008 4 3
2009 3 2
(1) One possibility I tried is to compute two dataframe and bind them
t1 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Participated'], aggfunc = len)
t2 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Passed'], aggfunc = len)
tx = pd.concat([t1, t2] , axis = 1)
Res1 = tx['yes']
(2) The second possibility is to use an aggregation function .
import collections
dg = d1.groupby('ExamenYear')
Res2 = dg.agg({'Participated': len,'Passed': lambda x : collections.Counter(x == 'yes')[True]})
Res2.columns = ['Participated', 'OfWhichpassed']
Both ways are awckward to say the least. How is this done properly in pandas ?
P.S: I also tried value_counts instead of collections.Counter but could not get it to work
For reference: Few months ago, I asked similar question for R here and plyr could help
---- UPDATE ------
user DSM is right. there was a mistake in the desired table result.
(1) The code for option one is
t1 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], aggfunc = len)
t2 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Participated'], aggfunc = len)
t3 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Passed'], aggfunc = len)
Res1 = pd.DataFrame( {'All': t1,
'OfWhichParticipated': t2['yes'],
'OfWhichPassed': t3['yes']})
It will produce the result
All OfWhichParticipated OfWhichPassed
ExamenYear
2007 3 2 2
2008 4 3 3
2009 3 3 2
(2) For Option 2, thanks to user herrfz, I figured out how to use value_count and the code will be
Res2 = d1.groupby('ExamenYear').agg({'StudentID': len,
'Participated': lambda x: x.value_counts()['yes'],
'Passed': lambda x: x.value_counts()['yes']})
Res2.columns = ['All', 'OfWgichParticipated', 'OfWhichPassed']
which will produce the same result as Res1
My question remains though:
Using Option 2, will it be possible to use the same Variable twice (for another operation ?) can one pass a custom name for the resulting variable ?
---- A NEW UPDATE ----
I have finally decided to use apply which I understand is more flexible.
To create a frequency table in R, we can simply use table function but the output of table function returns a horizontal table. If we want to read the table in data frame format then we would need to read the table as a data frame using as. data. frame function.
To create a frequency column for categorical variable in an R data frame, we can use the transform function by defining the length of categorical variable using ave function. The output will have the duplicated frequencies as one value in the categorical column is likely to be repeated.
You can create tables with an unlimited number of variables by selecting Insert > Analysis > More and then selecting Tables > Multiway Table. For example, the table below shows Average monthly bill by Occupation, Work Status, and Gender.
I finally decided to use apply.
I am posting what I came up with hoping that it can be useful for others.
From what I understand from Wes' book "Python for Data analysis"
Here is what I came up with
def ZahlOccurence_0(x):
return pd.Series({'All': len(x['StudentID']),
'Part': sum(x['Participated'] == 'yes'),
'Pass' : sum(x['Passed'] == 'yes')})
when I run it :
d1.groupby('ExamenYear').apply(ZahlOccurence_0)
I get the correct results
All Part Pass
ExamenYear
2007 3 2 2
2008 4 3 3
2009 3 3 2
This approach would also allow me to combine frequencies with other stats
import numpy as np
d1['testValue'] = np.random.randn(len(d1))
def ZahlOccurence_1(x):
return pd.Series({'All': len(x['StudentID']),
'Part': sum(x['Participated'] == 'yes'),
'Pass' : sum(x['Passed'] == 'yes'),
'test' : x['testValue'].mean()})
d1.groupby('ExamenYear').apply(ZahlOccurence_1)
All Part Pass test
ExamenYear
2007 3 2 2 0.358702
2008 4 3 3 1.004504
2009 3 3 2 0.521511
I hope someone else will find this useful
You may use pandas crosstab function, which by default computes a frequency table of two or more variables. For example,
> import pandas as pd
> pd.crosstab(d1['ExamenYear'], d1['Passed'])
Passed no yes
ExamenYear
2007 1 2
2008 1 3
2009 1 2
Use the margins=True
option if you also want to see the subtotal of each row and column.
> pd.crosstab(d1['ExamenYear'], d1['Participated'], margins=True)
Participated no yes All
ExamenYear
2007 1 2 3
2008 1 3 4
2009 0 3 3
All 2 8 10
This:
d1.groupby('ExamenYear').agg({'Participated': len,
'Passed': lambda x: sum(x == 'yes')})
doesn't look way more awkward than the R solution, IMHO.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With