Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting a Pandas GroupBy output from Series to DataFrame

I'm starting with input data like this

df1 = pandas.DataFrame( {      "Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] ,      "City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"] } ) 

Which when printed appears as this:

   City     Name 0   Seattle    Alice 1   Seattle      Bob 2  Portland  Mallory 3   Seattle  Mallory 4   Seattle      Bob 5  Portland  Mallory 

Grouping is simple enough:

g1 = df1.groupby( [ "Name", "City"] ).count() 

and printing yields a GroupBy object:

                  City  Name Name    City Alice   Seattle      1     1 Bob     Seattle      2     2 Mallory Portland     2     2         Seattle      1     1 

But what I want eventually is another DataFrame object that contains all the rows in the GroupBy object. In other words I want to get the following result:

                  City  Name Name    City Alice   Seattle      1     1 Bob     Seattle      2     2 Mallory Portland     2     2 Mallory Seattle      1     1 

I can't quite see how to accomplish this in the pandas documentation. Any hints would be welcome.

like image 557
saveenr Avatar asked Apr 29 '12 16:04

saveenr


People also ask

Can you convert series to DataFrame in Python?

to_frame() function is used to convert the given series object to a dataframe. Parameter : name : The passed name should substitute for the series name (if it has one).

How do I turn a Groupby object into a list?

groupby() To Group Rows into List. By using DataFrame. gropby() function you can group rows on a column, select the column you want as a list from the grouped result and finally convert it to a list for each group using apply(list).


2 Answers

g1 here is a DataFrame. It has a hierarchical index, though:

In [19]: type(g1) Out[19]: pandas.core.frame.DataFrame  In [20]: g1.index Out[20]:  MultiIndex([('Alice', 'Seattle'), ('Bob', 'Seattle'), ('Mallory', 'Portland'),        ('Mallory', 'Seattle')], dtype=object) 

Perhaps you want something like this?

In [21]: g1.add_suffix('_Count').reset_index() Out[21]:        Name      City  City_Count  Name_Count 0    Alice   Seattle           1           1 1      Bob   Seattle           2           2 2  Mallory  Portland           2           2 3  Mallory   Seattle           1           1 

Or something like:

In [36]: DataFrame({'count' : df1.groupby( [ "Name", "City"] ).size()}).reset_index() Out[36]:        Name      City  count 0    Alice   Seattle      1 1      Bob   Seattle      2 2  Mallory  Portland      2 3  Mallory   Seattle      1 
like image 53
Wes McKinney Avatar answered Sep 22 '22 02:09

Wes McKinney


I want to slightly change the answer given by Wes, because version 0.16.2 requires as_index=False. If you don't set it, you get an empty dataframe.

Source:

Aggregation functions will not return the groups that you are aggregating over if they are named columns, when as_index=True, the default. The grouped columns will be the indices of the returned object.

Passing as_index=False will return the groups that you are aggregating over, if they are named columns.

Aggregating functions are ones that reduce the dimension of the returned objects, for example: mean, sum, size, count, std, var, sem, describe, first, last, nth, min, max. This is what happens when you do for example DataFrame.sum() and get back a Series.

nth can act as a reducer or a filter, see here.

import pandas as pd  df1 = pd.DataFrame({"Name":["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"],                     "City":["Seattle","Seattle","Portland","Seattle","Seattle","Portland"]}) print df1 # #       City     Name #0   Seattle    Alice #1   Seattle      Bob #2  Portland  Mallory #3   Seattle  Mallory #4   Seattle      Bob #5  Portland  Mallory # g1 = df1.groupby(["Name", "City"], as_index=False).count() print g1 # #                  City  Name #Name    City #Alice   Seattle      1     1 #Bob     Seattle      2     2 #Mallory Portland     2     2 #        Seattle      1     1 # 

EDIT:

In version 0.17.1 and later you can use subset in count and reset_index with parameter name in size:

print df1.groupby(["Name", "City"], as_index=False ).count() #IndexError: list index out of range  print df1.groupby(["Name", "City"]).count() #Empty DataFrame #Columns: [] #Index: [(Alice, Seattle), (Bob, Seattle), (Mallory, Portland), (Mallory, Seattle)]  print df1.groupby(["Name", "City"])[['Name','City']].count() #                  Name  City #Name    City                 #Alice   Seattle      1     1 #Bob     Seattle      2     2 #Mallory Portland     2     2 #        Seattle      1     1  print df1.groupby(["Name", "City"]).size().reset_index(name='count') #      Name      City  count #0    Alice   Seattle      1 #1      Bob   Seattle      2 #2  Mallory  Portland      2 #3  Mallory   Seattle      1 

The difference between count and size is that size counts NaN values while count does not.

like image 20
jezrael Avatar answered Sep 21 '22 02:09

jezrael