Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas groupby result into multiple columns

Tags:

python

pandas

I have a dataframe in which I'm looking to group and then partition the values within a group into multiple columns.

For example: say I have the following dataframe:

>>> import pandas as pd
>>> import numpy as np
>>> df=pd.DataFrame()
>>> df['Group']=['A','C','B','A','C','C']
>>> df['ID']=[1,2,3,4,5,6]
>>> df['Value']=np.random.randint(1,100,6)
>>> df
  Group  ID  Value
0     A   1     66
1     C   2      2
2     B   3     98
3     A   4     90
4     C   5     85
5     C   6     38
>>> 

I want to groupby the "Group" field, get the sum of the "Value" field, and get new fields, each of which holds the ID values of the group.

Currently I am able to do this as follows, but I am looking for a cleaner methodology:

First, I create a dataframe with a list of the IDs in each group.

>>> g=df.groupby('Group')
>>> result=g.agg({'Value':np.sum, 'ID':lambda x:x.tolist()})
>>> result
              ID  Value
Group                  
A         [1, 4]     98
B            [3]     76
C      [2, 5, 6]    204
>>> 

And then I use pd.Series to split those up into columns, rename them, and then join it back.

>>> id_df=result.ID.apply(lambda x:pd.Series(x))
>>> id_cols=['ID'+str(x) for x in range(1,len(id_df.columns)+1)]
>>> id_df.columns=id_cols
>>> 
>>> result.join(id_df)[id_cols+['Value']]
       ID1  ID2  ID3  Value
Group                      
A        1    4  NaN     98
B        3  NaN  NaN     76
C        2    5    6    204
>>> 

Is there a way to do this without first having to create the list of values?

like image 837
AJG519 Avatar asked Jan 26 '16 21:01

AJG519


People also ask

Can you use groupby with multiple columns in pandas?

Grouping by Multiple ColumnsYou can do this by passing a list of column names to groupby instead of a single string value.

How do you group by and sum multiple columns in pandas?

Use DataFrame. groupby(). sum() to group rows based on one or multiple columns and calculate sum agg function. groupby() function returns a DataFrameGroupBy object which contains an aggregate function sum() to calculate a sum of a given column for each group.

What does PD groupby return?

Returns a groupby object that contains information about the groups. Convenience method for frequency conversion and resampling of time series. See the user guide for more detailed usage and examples, including splitting an object into groups, iterating through groups, selecting a group, aggregation, and more.

Can I group by 2 columns?

Usage of Group By Multiple ColumnsAll the records with the same values for the respective columns mentioned in the grouping criteria can be grouped as a single column using the group by multiple-column technique. The group by multiple columns is used to get summarized data from a database's table(s).


2 Answers

You could use

id_df = grouped['ID'].apply(lambda x: pd.Series(x.values)).unstack()

to create id_df without the intermediate result DataFrame.


import pandas as pd
import numpy as np
np.random.seed(2016)

df = pd.DataFrame({'Group': ['A', 'C', 'B', 'A', 'C', 'C'],
                   'ID': [1, 2, 3, 4, 5, 6],
                   'Value': np.random.randint(1, 100, 6)})

grouped = df.groupby('Group')
values = grouped['Value'].agg('sum')
id_df = grouped['ID'].apply(lambda x: pd.Series(x.values)).unstack()
id_df = id_df.rename(columns={i: 'ID{}'.format(i + 1) for i in range(id_df.shape[1])})
result = pd.concat([id_df, values], axis=1)
print(result)

yields

       ID1  ID2  ID3  Value
Group                      
A        1    4  NaN     77
B        3  NaN  NaN     84
C        2    5    6     86
like image 90
unutbu Avatar answered Oct 12 '22 23:10

unutbu


Another way of doing this is to first added a "helper" column on to your data, then pivot your dataframe using the "helper" column, in the case below "ID_Count":

Using @unutbu setup:

import pandas as pd
import numpy as np
np.random.seed(2016)

df = pd.DataFrame({'Group': ['A', 'C', 'B', 'A', 'C', 'C'],
                   'ID': [1, 2, 3, 4, 5, 6],
                   'Value': np.random.randint(1, 100, 6)})
#Create group
grp = df.groupby('Group')

#Create helper column 
df['ID_Count'] = grp['ID'].cumcount() + 1

#Pivot dataframe using helper column and add 'Value' column to pivoted output.
df_out = df.pivot('Group','ID_Count','ID').add_prefix('ID').assign(Value = grp['Value'].sum())

Output:

ID_Count  ID1  ID2  ID3  Value
Group                         
A         1.0  4.0  NaN     77
B         3.0  NaN  NaN     84
C         2.0  5.0  6.0     86
like image 34
Scott Boston Avatar answered Oct 13 '22 00:10

Scott Boston