I have a dataframe in which I'm looking to group and then partition the values within a group into multiple columns.
For example: say I have the following dataframe:
>>> import pandas as pd
>>> import numpy as np
>>> df=pd.DataFrame()
>>> df['Group']=['A','C','B','A','C','C']
>>> df['ID']=[1,2,3,4,5,6]
>>> df['Value']=np.random.randint(1,100,6)
>>> df
Group ID Value
0 A 1 66
1 C 2 2
2 B 3 98
3 A 4 90
4 C 5 85
5 C 6 38
>>>
I want to groupby the "Group" field, get the sum of the "Value" field, and get new fields, each of which holds the ID values of the group.
Currently I am able to do this as follows, but I am looking for a cleaner methodology:
First, I create a dataframe with a list of the IDs in each group.
>>> g=df.groupby('Group')
>>> result=g.agg({'Value':np.sum, 'ID':lambda x:x.tolist()})
>>> result
ID Value
Group
A [1, 4] 98
B [3] 76
C [2, 5, 6] 204
>>>
And then I use pd.Series to split those up into columns, rename them, and then join it back.
>>> id_df=result.ID.apply(lambda x:pd.Series(x))
>>> id_cols=['ID'+str(x) for x in range(1,len(id_df.columns)+1)]
>>> id_df.columns=id_cols
>>>
>>> result.join(id_df)[id_cols+['Value']]
ID1 ID2 ID3 Value
Group
A 1 4 NaN 98
B 3 NaN NaN 76
C 2 5 6 204
>>>
Is there a way to do this without first having to create the list of values?
Grouping by Multiple ColumnsYou can do this by passing a list of column names to groupby instead of a single string value.
Use DataFrame. groupby(). sum() to group rows based on one or multiple columns and calculate sum agg function. groupby() function returns a DataFrameGroupBy object which contains an aggregate function sum() to calculate a sum of a given column for each group.
Returns a groupby object that contains information about the groups. Convenience method for frequency conversion and resampling of time series. See the user guide for more detailed usage and examples, including splitting an object into groups, iterating through groups, selecting a group, aggregation, and more.
Usage of Group By Multiple ColumnsAll the records with the same values for the respective columns mentioned in the grouping criteria can be grouped as a single column using the group by multiple-column technique. The group by multiple columns is used to get summarized data from a database's table(s).
You could use
id_df = grouped['ID'].apply(lambda x: pd.Series(x.values)).unstack()
to create id_df
without the intermediate result
DataFrame.
import pandas as pd
import numpy as np
np.random.seed(2016)
df = pd.DataFrame({'Group': ['A', 'C', 'B', 'A', 'C', 'C'],
'ID': [1, 2, 3, 4, 5, 6],
'Value': np.random.randint(1, 100, 6)})
grouped = df.groupby('Group')
values = grouped['Value'].agg('sum')
id_df = grouped['ID'].apply(lambda x: pd.Series(x.values)).unstack()
id_df = id_df.rename(columns={i: 'ID{}'.format(i + 1) for i in range(id_df.shape[1])})
result = pd.concat([id_df, values], axis=1)
print(result)
yields
ID1 ID2 ID3 Value
Group
A 1 4 NaN 77
B 3 NaN NaN 84
C 2 5 6 86
Another way of doing this is to first added a "helper" column on to your data, then pivot your dataframe using the "helper" column, in the case below "ID_Count":
Using @unutbu setup:
import pandas as pd
import numpy as np
np.random.seed(2016)
df = pd.DataFrame({'Group': ['A', 'C', 'B', 'A', 'C', 'C'],
'ID': [1, 2, 3, 4, 5, 6],
'Value': np.random.randint(1, 100, 6)})
#Create group
grp = df.groupby('Group')
#Create helper column
df['ID_Count'] = grp['ID'].cumcount() + 1
#Pivot dataframe using helper column and add 'Value' column to pivoted output.
df_out = df.pivot('Group','ID_Count','ID').add_prefix('ID').assign(Value = grp['Value'].sum())
Output:
ID_Count ID1 ID2 ID3 Value
Group
A 1.0 4.0 NaN 77
B 3.0 NaN NaN 84
C 2.0 5.0 6.0 86
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With