I'm trying to wrap my head around Pandas groupby methods. I'd like to write a function that does some aggregation functions and then returns a Pandas DataFrame. Here's a grossly simplified example using sum(). I know there are easier ways to do simple sums, in real life my function is more complex:
import pandas as pd
df = pd.DataFrame({'col1': ['A', 'A', 'B', 'B'], 'col2':[1.0, 2, 3, 4]})
In [3]: df
Out[3]:
col1 col2
0 A 1
1 A 2
2 B 3
3 B 4
def func2(df):
dfout = pd.DataFrame({ 'col1' : df['col1'].unique() ,
'someData': sum(df['col2']) })
return dfout
t = df.groupby('col1').apply(func2)
In [6]: t
Out[6]:
col1 someData
col1
A 0 A 3
B 0 B 7
I did not expect to have col1
in there twice nor did I expect that mystery index looking thing. I really thought I would just get col1
& someData
.
In my real life application I'm grouping by more than one column and really would like to get back a DataFrame and not a Series object.
Any ideas for a solution or an explanation on what Pandas is doing in my example above?
----- added info -----
I should have started with this example, I think:
In [13]: import pandas as pd
In [14]: df = pd.DataFrame({'col1':['A','A','A','B','B','B'], 'col2':['C','D','D','D','C','C'], 'col3':[.1,.2,.4,.6,.8,1]})
In [15]: df
Out[15]:
col1 col2 col3
0 A C 0.1
1 A D 0.2
2 A D 0.4
3 B D 0.6
4 B C 0.8
5 B C 1.0
In [16]: def func3(df):
....: dfout = sum(df['col3']**2)
....: return dfout
....:
In [17]: t = df.groupby(['col1', 'col2']).apply(func3)
In [18]: t
Out[18]:
col1 col2
A C 0.01
D 0.20
B C 1.64
D 0.36
In the above illustration the result of the apply()
function is a Pandas Series. And it lacks the groupby columns from the df.groupby
. The essence of what I'm struggling with is how do I create a function which I apply to a groupby which returns both the result of the function AND the columns on which it was grouped?
----- yet another update ------
It appears that if I then do this:
pd.DataFrame(t).reset_index()
I get back a dataframe which is really close to what I was after.
You can group DataFrame rows into a list by using pandas. DataFrame. groupby() function on the column of interest, select the column you want as a list from group and then use Series. apply(list) to get the list for every group.
GROUP BY returns a single row for each unique combination of the GROUP BY fields. So in your example, every distinct combination of (a1, a2) occurring in rows of Tab1 results in a row in the query representing the group of rows with the given combination of group by field values .
The reason you are seeing the columns with 0s is because the output of .unique()
is an array.
The best way to understand how your apply is going to work is to inspect each action group-wise:
In [11] :g = df.groupby('col1')
In [12]: g.get_group('A')
Out[12]:
col1 col2
0 A 1
1 A 2
In [13]: g.get_group('A')['col1'].unique()
Out[13]: array([A], dtype=object)
In [14]: sum(g.get_group('A')['col2'])
Out[14]: 3.0
The majority of the time you want this to be an aggregated value.
The output of grouped.apply
will always have the group labels as an index (the unique values of 'col1'), so your example construction of col1
seems a little obtuse to me.
Note: To pop 'col1'
(the index) back to a column you can call reset_index
, so in this case.
In [15]: g.sum().reset_index()
Out[15]:
col1 col2
0 A 3
1 B 7
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With