Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas : Assign result of groupby to dataframe to a new column

I have the following toy dataframe (the real one has 500k rows):

df = pd.DataFrame({'size': list('SSMMMLS'),
                   'weight': [8, 10, 11, 1, 20, 14, 12],
                   'adult' : [False] * 5 + [True] * 2})

   adult size  weight
0  False    S       8
1  False    S      10
2  False    M      11
3  False    M       1
4  False    M      20
5   True    L      14
6   True    S      12

And want to groupby adult, select the row for which weight is maximal and assign in a new column size2 the size column value.

In other words we want a column size2 with the size value of the line with the max weight propagated to the adult groupby. So all adult = False lines will have value S because adult=False max weight is 20.

   adult size size2  weight
0  False    S     S       8
1  False    S     S      10
2  False    M     S      11
3  False    M     S       1
4  False    M     S      20
5   True    L     L      14
6   True    S     L      12

I found this but it doesn't work for me

So far I have :

df.loc[:, 'size2'] = (df.groupby('adult',as_index=True)['weight','size']
                        .transform(lambda x: x.ix[x['weight'].idxmax()]['size']))
like image 508
Gilles Cuyaubere Avatar asked Mar 03 '16 19:03

Gilles Cuyaubere


People also ask

How do I create a new column from the output of pandas Groupby () SUM ()?

One of the simplest methods on groupby objects is the sum() method. To create a new column for the output of groupby. sum(), we will first apply the groupby. sim() operation and then we will store this result in a new column.

How do I turn a Groupby object into a list?

groupby() To Group Rows into List. By using DataFrame. gropby() function you can group rows on a column, select the column you want as a list from the grouped result and finally convert it to a list for each group using apply(list).


3 Answers

Just a more detailed veresion of the @jazrael answer, with your dataframe:

df = pd.DataFrame({'size': list('SSMMMLS'),
                   'weight': [8, 10, 11, 1, 20, 14, 12],
                   'adult' : [False] * 5 + [True] * 2})
#    adult size  weight
# 0  False    S       8
# 1  False    S      10
# 2  False    M      11
# 3  False    M       1
# 4  False    M      20
# 5   True    L      14
# 6   True    S      12

To get size value for the max weight line:

def size4max_weight(subf):
    """ Return size value for the max weight line """
    return subf['size'][subf['weight'].idxmax()]

A groupby on 'adult' will produce a Serie with False, True as indexes values::

>>> size2_col = df.groupby('adult').apply(size4max_weight)
>>> type(size2_col), size2_col.index
(pandas.core.series.Series, Index([False, True], dtype='object', name=u'adult'))

With reset_index we convert the serie in DataFrame::

>>> size2_col = df.groupby('adult').apply(size4max_weight).reset_index(name='size2')
>>> size2_col
   adult size2
0  False     M
1   True     L
>>>

pd.merge on 'adult' make it:

>>> pd.merge(df, size2_col, on=['adult'])
   adult size  weight size2
0  False    S       8     M
1  False    S      10     M
2  False    M      11     M
3  False    M       1     M
4  False    M      20     M
5   True    L      14     L
6   True    S      12     L
like image 140
user3313834 Avatar answered Oct 18 '22 20:10

user3313834


You could use transform with loc and values:

>>> df["size2"] = df["size"].loc[df.groupby("adult")["weight"].transform("idxmax")].values
>>> df
   adult size  weight size2
0  False    S       8     M
1  False    S      10     M
2  False    M      11     M
3  False    M       1     M
4  False    M      20     M
5   True    L      14     L
6   True    S      12     L

Step by step, first we find the appropriate indices:

>>> df.groupby("adult")["weight"].transform("idxmax")
0    4
1    4
2    4
3    4
4    4
5    5
6    5
dtype: int64

Then we use these to index into the size column with loc:

>>> df["size"].loc[df.groupby("adult")["weight"].transform("idxmax")]
4    M
4    M
4    M
4    M
4    M
5    L
5    L
Name: size, dtype: object

And finally we take .values so that the indices don't get in the way when we try to assign:

>>> df["size"].loc[df.groupby("adult")["weight"].transform("idxmax")].values
array(['M', 'M', 'M', 'M', 'M', 'L', 'L'], dtype=object)

>>> df["size2"] = df["size"].loc[df.groupby("adult")["weight"].transform("idxmax")].values

>>> df
   adult size  weight size2
0  False    S       8     M
1  False    S      10     M
2  False    M      11     M
3  False    M       1     M
4  False    M      20     M
5   True    L      14     L
6   True    S      12     L
>>> 
like image 26
DSM Avatar answered Oct 18 '22 20:10

DSM


IIUC you can use merge. I think first value in size2 is M, because max weight is 20.

df = pd.DataFrame({'size': list('SSMMMLS'),
                   'weight': [8, 10, 11, 1, 20, 14, 12],
                   'adult' : [False] * 5 + [True] * 2})

print(df)
   adult size  weight
0  False    S       8
1  False    S      10
2  False    M      11
3  False    M       1
4  False    M      20
5   True    L      14
6   True    S      12

print(
    df.groupby('adult') 
       .apply(lambda subf: subf['size'][subf['weight'].idxmax()]).reset_index(name='size2')
    )               
   adult size2
0  False     M
1   True     L

print(
    pd.merge(df, 
             df.groupby('adult')
               .apply(lambda subf: subf['size'][subf['weight'].idxmax()]
                     ).reset_index(name='size2'), on=['adult'])
      )          
   adult size  weight size2
0  False    S       8     M
1  False    S      10     M
2  False    M      11     M
3  False    M       1     M
4  False    M      20     M
5   True    L      14     L
6   True    S      12     L
like image 1
jezrael Avatar answered Oct 18 '22 21:10

jezrael