Pandas groupby result into multiple columns

Tags:

python

pandas

I have a dataframe in which I'm looking to group and then partition the values within a group into multiple columns.

For example: say I have the following dataframe:

>>> import pandas as pd
>>> import numpy as np
>>> df=pd.DataFrame()
>>> df['Group']=['A','C','B','A','C','C']
>>> df['ID']=[1,2,3,4,5,6]
>>> df['Value']=np.random.randint(1,100,6)
>>> df
  Group  ID  Value
0     A   1     66
1     C   2      2
2     B   3     98
3     A   4     90
4     C   5     85
5     C   6     38
>>>

I want to groupby the "Group" field, get the sum of the "Value" field, and get new fields, each of which holds the ID values of the group.

Currently I am able to do this as follows, but I am looking for a cleaner methodology:

First, I create a dataframe with a list of the IDs in each group.

>>> g=df.groupby('Group')
>>> result=g.agg({'Value':np.sum, 'ID':lambda x:x.tolist()})
>>> result
              ID  Value
Group                  
A         [1, 4]     98
B            [3]     76
C      [2, 5, 6]    204
>>>

And then I use pd.Series to split those up into columns, rename them, and then join it back.

>>> id_df=result.ID.apply(lambda x:pd.Series(x))
>>> id_cols=['ID'+str(x) for x in range(1,len(id_df.columns)+1)]
>>> id_df.columns=id_cols
>>> 
>>> result.join(id_df)[id_cols+['Value']]
       ID1  ID2  ID3  Value
Group                      
A        1    4  NaN     98
B        3  NaN  NaN     76
C        2    5    6    204
>>>

Is there a way to do this without first having to create the list of values?

837

asked Jan 26 '16 21:01

AJG519

2 Answers

You could use

id_df = grouped['ID'].apply(lambda x: pd.Series(x.values)).unstack()

to create id_df without the intermediate result DataFrame.

import pandas as pd
import numpy as np
np.random.seed(2016)

df = pd.DataFrame({'Group': ['A', 'C', 'B', 'A', 'C', 'C'],
                   'ID': [1, 2, 3, 4, 5, 6],
                   'Value': np.random.randint(1, 100, 6)})

grouped = df.groupby('Group')
values = grouped['Value'].agg('sum')
id_df = grouped['ID'].apply(lambda x: pd.Series(x.values)).unstack()
id_df = id_df.rename(columns={i: 'ID{}'.format(i + 1) for i in range(id_df.shape[1])})
result = pd.concat([id_df, values], axis=1)
print(result)

yields

       ID1  ID2  ID3  Value
Group                      
A        1    4  NaN     77
B        3  NaN  NaN     84
C        2    5    6     86

answered Oct 12 '22 23:10

unutbu

Another way of doing this is to first added a "helper" column on to your data, then pivot your dataframe using the "helper" column, in the case below "ID_Count":

Using @unutbu setup:

import pandas as pd
import numpy as np
np.random.seed(2016)

df = pd.DataFrame({'Group': ['A', 'C', 'B', 'A', 'C', 'C'],
                   'ID': [1, 2, 3, 4, 5, 6],
                   'Value': np.random.randint(1, 100, 6)})
#Create group
grp = df.groupby('Group')

#Create helper column 
df['ID_Count'] = grp['ID'].cumcount() + 1

#Pivot dataframe using helper column and add 'Value' column to pivoted output.
df_out = df.pivot('Group','ID_Count','ID').add_prefix('ID').assign(Value = grp['Value'].sum())

Output:

ID_Count  ID1  ID2  ID3  Value
Group                         
A         1.0  4.0  NaN     77
B         3.0  NaN  NaN     84
C         2.0  5.0  6.0     86

answered Oct 13 '22 00:10

Scott Boston

Related questions
                            
                                string formatting a sql query in sqlite3
                            
                                Slow ANTLR4 generated Parser in Python, but fast in Java
                            
                                Adjust exponent text after setting scientific limits on matplotlib axis
                            
                                perfomance of len(List) vs reading a variable
                            
                                Plot Multicolored line based on conditional in python
                            
                                celery and long running tasks
                            
                                Plotting asymmetric error bars for a single point using errorbar
                            
                                PyQt5 and asyncio: yield from never finishes
                            
                                How to shift a graph along the x-axis?
                            
                                How to make a .pyc file from Python script [duplicate]
                            
                                Iterate through python string starting at index
                            
                                How to add topic filters when calling recv_pyobj() in ZeroMQ?
                            
                                np.where Not Working in my Pandas
                            
                                Manually-defined axis labels for Matplotlib imshow()
                            
                                spark-submit EMR Step failing when submitted using boto3 client
                            
                                'importlib._bootstrap' has no attribute 'SourceLoader'
                            
                                Scikit-Learn: Label not x is present in all training examples
                            
                                Access contents of list after applying Counter from collections module
                            
                                Using eventlet to manage socketio in Flask
                            
                                Is it possible to know the maximum number accepted by chr using Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With