Edit:
I need the apply function that returns several value from several complex calculations. I can return those values in a tuple, and thus the outcome of the groupby-apply action would be a Series with group name as indexes and the tuple as values. I would like it to return a DataFrame instead, So I could keep all the pandas functionality and flexibility.
In general, The outcome of a groupby-apply operation would be a Series In the case apply returning 1 value. In the case of apply returning 2 or more values, I would like the outcome to be a dataframe. so my question is how to do that. See the original Q for more details and examples
Original Q:
I have a dataframe which contains many columns and groups. I trying to do group-wise operation via the groupby-apply mechanism, and retrive only 2 values for each group. Currently, Im returning a tuple for each group (e.g. return (a,b)) and thus the result Im getting is a series with group names as indexes, and tupels as values.
This is not the best output for me, as I next need to sort by one of this values, and In general this way Im losing much of the DataFrame and Series functionality.
What I would like to get back instead is a dataFrame with columns 'a' and 'b'.
for example, say a have a large dataframe df that look something like that:
Out[123]:
ID1 ID2 score
0 6073165338_1 6073165338 100
1 6073165338_1 6073165338 89
2 6073165338_1 6073165338 87
3 6073165338_1 6073165338 65
4 6073165338_1 6073165338 62
I would like to group it by ID1, and return for each group the ID2 (which is the same for each ID1 group) and the average score of the first 3 entries. I can do something like that:
def calc(grp):
return grp.ID2.iloc[0],grp.score[:2].mean()
the results of df.groupby('ID1').apply(calc) whould be a series with the ID1 group as index, and tuple with 2 elements as values:
6073165338_1 (6073165338, 94.5)
I want the output to be a dataframe with the same index and a the two values as columns in the dataframe, so I would be able to keep the analysis going easily.
How do I do that?
Ok, I have two solutions for this. the first one is probably better, still I would appreciate a comment from the expert. first option is to have the applied function return a tuple and then convert the Series of tuples to a DataFrame:
s = x.groupby('ID1').apply(calc)
DataFrame(s.tolist(),index = s.index,columns = ['ID2','top3avg'])
This results in:
Out[156]:
ID2 top3avg
ID1
6073165338_1 6073165338 94.5
The second one is to return a dataframe using the dataframe constructor on the returned tuple:
def calc(grp):
return DataFrame([(grp.ID2.iloc[0],grp.score[:2].mean())],columns=['ID2','top3avg'])
The result of x.groupby('ID1').apply(calc) is now a dataframe:
ID2 top3avg
ID1
6073165338_1 0 6073165338 94.5
First option seems better since:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With