Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Frequency tables in pandas (like plyr in R)

My problem is how to calculate frequencies on multiple variables in pandas . I have from this dataframe :

d1 = pd.DataFrame( {'StudentID': ["x1", "x10", "x2","x3", "x4", "x5", "x6",   "x7",     "x8", "x9"],
                       'StudentGender' : ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M'],
                 'ExamenYear': ['2007','2007','2007','2008','2008','2008','2008','2009','2009','2009'],
                 'Exam': ['algebra', 'stats', 'bio', 'algebra', 'algebra', 'stats', 'stats', 'algebra', 'bio', 'bio'],
                 'Participated': ['no','yes','yes','yes','no','yes','yes','yes','yes','yes'],
                  'Passed': ['no','yes','yes','yes','no','yes','yes','yes','no','yes']},
                  columns = ['StudentID', 'StudentGender', 'ExamenYear', 'Exam', 'Participated', 'Passed'])

To the following result

             Participated  OfWhichpassed
 ExamenYear                             
2007                   3              2
2008                   4              3
2009                   3              2

(1) One possibility I tried is to compute two dataframe and bind them

t1 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Participated'], aggfunc = len)
t2 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Passed'], aggfunc = len)
tx = pd.concat([t1, t2] , axis = 1)

Res1 = tx['yes']

(2) The second possibility is to use an aggregation function .

import collections
dg = d1.groupby('ExamenYear')
Res2 = dg.agg({'Participated': len,'Passed': lambda x : collections.Counter(x == 'yes')[True]})

 Res2.columns = ['Participated', 'OfWhichpassed']

Both ways are awckward to say the least. How is this done properly in pandas ?

P.S: I also tried value_counts instead of collections.Counter but could not get it to work

For reference: Few months ago, I asked similar question for R here and plyr could help

---- UPDATE ------

user DSM is right. there was a mistake in the desired table result.

(1) The code for option one is

 t1 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], aggfunc = len)
 t2 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Participated'], aggfunc = len)
 t3 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Passed'], aggfunc = len)

 Res1 = pd.DataFrame( {'All': t1,
                       'OfWhichParticipated': t2['yes'],
                     'OfWhichPassed': t3['yes']})

It will produce the result

             All  OfWhichParticipated  OfWhichPassed
ExamenYear                                         
2007          3                    2              2
2008          4                    3              3
2009          3                    3              2

(2) For Option 2, thanks to user herrfz, I figured out how to use value_count and the code will be

Res2 = d1.groupby('ExamenYear').agg({'StudentID': len,
                                 'Participated': lambda x: x.value_counts()['yes'],
                                 'Passed': lambda x: x.value_counts()['yes']})

Res2.columns = ['All', 'OfWgichParticipated', 'OfWhichPassed']

which will produce the same result as Res1

My question remains though:

Using Option 2, will it be possible to use the same Variable twice (for another operation ?) can one pass a custom name for the resulting variable ?

---- A NEW UPDATE ----

I have finally decided to use apply which I understand is more flexible.

like image 914
user1043144 Avatar asked Mar 23 '13 16:03

user1043144


People also ask

How do I create a frequency table from a DataFrame in R?

To create a frequency table in R, we can simply use table function but the output of table function returns a horizontal table. If we want to read the table in data frame format then we would need to read the table as a data frame using as. data. frame function.

How do you make a frequency table for categorical data in R?

To create a frequency column for categorical variable in an R data frame, we can use the transform function by defining the length of categorical variable using ave function. The output will have the duplicated frequencies as one value in the categorical column is likely to be repeated.

How do you make a multivariable table in R?

You can create tables with an unlimited number of variables by selecting Insert > Analysis > More and then selecting Tables > Multiway Table. For example, the table below shows Average monthly bill by Occupation, Work Status, and Gender.


3 Answers

I finally decided to use apply.

I am posting what I came up with hoping that it can be useful for others.

From what I understand from Wes' book "Python for Data analysis"

  • apply is more flexible than agg and transform because you can define your own function.
  • the only requirement is that the functions returns a pandas object or a scalar value.
  • the inner mechanics: the function is called on each piece of the grouped object abd results are glued together using pandas.concat
  • One needs to "hard-code" structure you want at the end

Here is what I came up with

def ZahlOccurence_0(x):
      return pd.Series({'All': len(x['StudentID']),
                       'Part': sum(x['Participated'] == 'yes'),
                       'Pass' :  sum(x['Passed'] == 'yes')})

when I run it :

 d1.groupby('ExamenYear').apply(ZahlOccurence_0)

I get the correct results

            All  Part  Pass
ExamenYear                 
2007          3     2     2
2008          4     3     3
2009          3     3     2

This approach would also allow me to combine frequencies with other stats

import numpy as np
d1['testValue'] = np.random.randn(len(d1))

def ZahlOccurence_1(x):
    return pd.Series({'All': len(x['StudentID']),
        'Part': sum(x['Participated'] == 'yes'),
        'Pass' :  sum(x['Passed'] == 'yes'),
        'test' : x['testValue'].mean()})


d1.groupby('ExamenYear').apply(ZahlOccurence_1)


            All  Part  Pass      test
ExamenYear                           
2007          3     2     2  0.358702
2008          4     3     3  1.004504
2009          3     3     2  0.521511

I hope someone else will find this useful

like image 86
user1043144 Avatar answered Oct 23 '22 15:10

user1043144


You may use pandas crosstab function, which by default computes a frequency table of two or more variables. For example,

> import pandas as pd
> pd.crosstab(d1['ExamenYear'], d1['Passed'])
Passed      no  yes
ExamenYear         
2007         1    2
2008         1    3
2009         1    2

Use the margins=True option if you also want to see the subtotal of each row and column.

> pd.crosstab(d1['ExamenYear'], d1['Participated'], margins=True)
Participated  no  yes  All
ExamenYear                
2007           1    2    3
2008           1    3    4
2009           0    3    3
All            2    8   10
like image 9
Ida Avatar answered Oct 23 '22 14:10

Ida


This:

d1.groupby('ExamenYear').agg({'Participated': len, 
                              'Passed': lambda x: sum(x == 'yes')})

doesn't look way more awkward than the R solution, IMHO.

like image 8
herrfz Avatar answered Oct 23 '22 13:10

herrfz