Outputting a DataFrame instead of Series from a multiple return value groupby-apply operation

Question

Edit:

I need the apply function that returns several value from several complex calculations. I can return those values in a tuple, and thus the outcome of the groupby-apply action would be a Series with group name as indexes and the tuple as values. I would like it to return a DataFrame instead, So I could keep all the pandas functionality and flexibility.

In general, The outcome of a groupby-apply operation would be a Series In the case apply returning 1 value. In the case of apply returning 2 or more values, I would like the outcome to be a dataframe. so my question is how to do that. See the original Q for more details and examples

Original Q:

I have a dataframe which contains many columns and groups. I trying to do group-wise operation via the groupby-apply mechanism, and retrive only 2 values for each group. Currently, Im returning a tuple for each group (e.g. return (a,b)) and thus the result Im getting is a series with group names as indexes, and tupels as values.

This is not the best output for me, as I next need to sort by one of this values, and In general this way Im losing much of the DataFrame and Series functionality.

What I would like to get back instead is a dataFrame with columns 'a' and 'b'.

for example, say a have a large dataframe df that look something like that:

Out[123]:
         ID1            ID2     score
0    6073165338_1    6073165338  100
1    6073165338_1    6073165338  89
2    6073165338_1    6073165338  87
3    6073165338_1    6073165338  65
4    6073165338_1    6073165338  62

I would like to group it by ID1, and return for each group the ID2 (which is the same for each ID1 group) and the average score of the first 3 entries. I can do something like that:

def calc(grp):
    return grp.ID2.iloc[0],grp.score[:2].mean()

the results of df.groupby('ID1').apply(calc) whould be a series with the ID1 group as index, and tuple with 2 elements as values:

6073165338_1 (6073165338, 94.5)

I want the output to be a dataframe with the same index and a the two values as columns in the dataframe, so I would be able to keep the analysis going easily.

How do I do that?

idoda · Accepted Answer

Ok, I have two solutions for this. the first one is probably better, still I would appreciate a comment from the expert. first option is to have the applied function return a tuple and then convert the Series of tuples to a DataFrame:

s = x.groupby('ID1').apply(calc)
DataFrame(s.tolist(),index = s.index,columns = ['ID2','top3avg'])

This results in:

Out[156]:
                     ID2    top3avg
ID1     
6073165338_1     6073165338  94.5

The second one is to return a dataframe using the dataframe constructor on the returned tuple:

def calc(grp):
    return DataFrame([(grp.ID2.iloc[0],grp.score[:2].mean())],columns=['ID2','top3avg'])

The result of x.groupby('ID1').apply(calc) is now a dataframe:

                         ID2    top3avg
ID1         
6073165338_1    0    6073165338  94.5

First option seems better since:

It runs the DF constructor only once, at the end of the groupby-apply action
It does not return the unnecessary integer index.

Outputting a DataFrame instead of Series from a multiple return value groupby-apply operation

Tags:

python

pandas

idoda

1 Answers

idoda

Recent Activity

Donate For Us

Outputting a DataFrame instead of Series from a multiple return value groupby-apply operation

Tags:

python

pandas

idoda

1 Answers

idoda

Related questions

Recent Activity

Donate For Us