I have the following toy dataframe (the real one has 500k rows): <pre class="prettyprint"><code>df = pd.DataFrame({'size': list('SSMMMLS'), 'weight': [8, 10, 11, 1, 20, 14, 12], 'adult' : [False] * 5 + [True] * 2}) adult size weight 0 False S 8 1 False S 10 2 False M 11 3 False M 1 4 False M 20 5 True L 14 6 True S 12 </code></pre> And want to groupby <code>adult</code>, select the row for which <code>weight</code> is maximal and assign in a new column <code>size2</code> the <code>size</code> column value. In other words we want a column size2 with the size value of the line with the max <code>weight</code> propagated to the <code>adult</code> groupby. So all <code>adult</code> = False lines will have value S because adult=False max weight is 20. <pre class="prettyprint"><code> adult size size2 weight 0 False S S 8 1 False S S 10 2 False M S 11 3 False M S 1 4 False M S 20 5 True L L 14 6 True S L 12 </code></pre> I found this but it doesn't work for me So far I have : <pre class="prettyprint"><code>df.loc[:, 'size2'] = (df.groupby('adult',as_index=True)['weight','size'] .transform(lambda x: x.ix[x['weight'].idxmax()]['size'])) </code></pre>

Just a more detailed veresion of the @jazrael answer, with your dataframe: <pre class="prettyprint"><code>df = pd.DataFrame({'size': list('SSMMMLS'), 'weight': [8, 10, 11, 1, 20, 14, 12], 'adult' : [False] * 5 + [True] * 2}) # adult size weight # 0 False S 8 # 1 False S 10 # 2 False M 11 # 3 False M 1 # 4 False M 20 # 5 True L 14 # 6 True S 12 </code></pre> To get size value for the max weight line: <pre class="prettyprint"><code>def size4max_weight(subf): """ Return size value for the max weight line """ return subf['size'][subf['weight'].idxmax()] </code></pre> A groupby on 'adult' will produce a Serie with False, True as indexes values:: <pre class="prettyprint"><code>>>> size2_col = df.groupby('adult').apply(size4max_weight) >>> type(size2_col), size2_col.index (pandas.core.series.Series, Index([False, True], dtype='object', name=u'adult')) </code></pre> With <code>reset_index</code> we convert the serie in DataFrame:: <pre class="prettyprint"><code>>>> size2_col = df.groupby('adult').apply(size4max_weight).reset_index(name='size2') >>> size2_col adult size2 0 False M 1 True L >>> </code></pre> <code>pd.merge</code> on 'adult' make it: <pre class="prettyprint"><code>>>> pd.merge(df, size2_col, on=['adult']) adult size weight size2 0 False S 8 M 1 False S 10 M 2 False M 11 M 3 False M 1 M 4 False M 20 M 5 True L 14 L 6 True S 12 L </code></pre>

You could use <code>transform</code> with <code>loc</code> and <code>values</code>: <pre class="prettyprint"><code>>>> df["size2"] = df["size"].loc[df.groupby("adult")["weight"].transform("idxmax")].values >>> df adult size weight size2 0 False S 8 M 1 False S 10 M 2 False M 11 M 3 False M 1 M 4 False M 20 M 5 True L 14 L 6 True S 12 L </code></pre> <hr> Step by step, first we find the appropriate indices: <pre class="prettyprint"><code>>>> df.groupby("adult")["weight"].transform("idxmax") 0 4 1 4 2 4 3 4 4 4 5 5 6 5 dtype: int64 </code></pre> Then we use these to index into the <code>size</code> column with <code>loc</code>: <pre class="prettyprint"><code>>>> df["size"].loc[df.groupby("adult")["weight"].transform("idxmax")] 4 M 4 M 4 M 4 M 4 M 5 L 5 L Name: size, dtype: object </code></pre> And finally we take <code>.values</code> so that the indices don't get in the way when we try to assign: <pre class="prettyprint"><code>>>> df["size"].loc[df.groupby("adult")["weight"].transform("idxmax")].values array(['M', 'M', 'M', 'M', 'M', 'L', 'L'], dtype=object) >>> df["size2"] = df["size"].loc[df.groupby("adult")["weight"].transform("idxmax")].values >>> df adult size weight size2 0 False S 8 M 1 False S 10 M 2 False M 11 M 3 False M 1 M 4 False M 20 M 5 True L 14 L 6 True S 12 L >>> </code></pre>

Pandas : Assign result of groupby to dataframe to a new column

Tags:

python

pandas

dataframe

group-by

I have the following toy dataframe (the real one has 500k rows):

df = pd.DataFrame({'size': list('SSMMMLS'),
                   'weight': [8, 10, 11, 1, 20, 14, 12],
                   'adult' : [False] * 5 + [True] * 2})

   adult size  weight
0  False    S       8
1  False    S      10
2  False    M      11
3  False    M       1
4  False    M      20
5   True    L      14
6   True    S      12

And want to groupby adult, select the row for which weight is maximal and assign in a new column size2 the size column value.

In other words we want a column size2 with the size value of the line with the max weight propagated to the adult groupby. So all adult = False lines will have value S because adult=False max weight is 20.

   adult size size2  weight
0  False    S     S       8
1  False    S     S      10
2  False    M     S      11
3  False    M     S       1
4  False    M     S      20
5   True    L     L      14
6   True    S     L      12

I found this but it doesn't work for me

So far I have :

df.loc[:, 'size2'] = (df.groupby('adult',as_index=True)['weight','size']
                        .transform(lambda x: x.ix[x['weight'].idxmax()]['size']))

508

asked Mar 03 '16 19:03

Gilles Cuyaubere

3 Answers

Just a more detailed veresion of the @jazrael answer, with your dataframe:

df = pd.DataFrame({'size': list('SSMMMLS'),
                   'weight': [8, 10, 11, 1, 20, 14, 12],
                   'adult' : [False] * 5 + [True] * 2})
#    adult size  weight
# 0  False    S       8
# 1  False    S      10
# 2  False    M      11
# 3  False    M       1
# 4  False    M      20
# 5   True    L      14
# 6   True    S      12

To get size value for the max weight line:

def size4max_weight(subf):
    """ Return size value for the max weight line """
    return subf['size'][subf['weight'].idxmax()]

A groupby on 'adult' will produce a Serie with False, True as indexes values::

>>> size2_col = df.groupby('adult').apply(size4max_weight)
>>> type(size2_col), size2_col.index
(pandas.core.series.Series, Index([False, True], dtype='object', name=u'adult'))

With reset_index we convert the serie in DataFrame::

>>> size2_col = df.groupby('adult').apply(size4max_weight).reset_index(name='size2')
>>> size2_col
   adult size2
0  False     M
1   True     L
>>>

pd.merge on 'adult' make it:

>>> pd.merge(df, size2_col, on=['adult'])
   adult size  weight size2
0  False    S       8     M
1  False    S      10     M
2  False    M      11     M
3  False    M       1     M
4  False    M      20     M
5   True    L      14     L
6   True    S      12     L

140

answered Oct 18 '22 20:10

user3313834

You could use transform with loc and values:

>>> df["size2"] = df["size"].loc[df.groupby("adult")["weight"].transform("idxmax")].values
>>> df
   adult size  weight size2
0  False    S       8     M
1  False    S      10     M
2  False    M      11     M
3  False    M       1     M
4  False    M      20     M
5   True    L      14     L
6   True    S      12     L

Step by step, first we find the appropriate indices:

>>> df.groupby("adult")["weight"].transform("idxmax")
0    4
1    4
2    4
3    4
4    4
5    5
6    5
dtype: int64

Then we use these to index into the size column with loc:

>>> df["size"].loc[df.groupby("adult")["weight"].transform("idxmax")]
4    M
4    M
4    M
4    M
4    M
5    L
5    L
Name: size, dtype: object

And finally we take .values so that the indices don't get in the way when we try to assign:

>>> df["size"].loc[df.groupby("adult")["weight"].transform("idxmax")].values
array(['M', 'M', 'M', 'M', 'M', 'L', 'L'], dtype=object)

>>> df["size2"] = df["size"].loc[df.groupby("adult")["weight"].transform("idxmax")].values

>>> df
   adult size  weight size2
0  False    S       8     M
1  False    S      10     M
2  False    M      11     M
3  False    M       1     M
4  False    M      20     M
5   True    L      14     L
6   True    S      12     L
>>>

answered Oct 18 '22 20:10

DSM

IIUC you can use merge. I think first value in size2 is M, because max weight is 20.

df = pd.DataFrame({'size': list('SSMMMLS'),
                   'weight': [8, 10, 11, 1, 20, 14, 12],
                   'adult' : [False] * 5 + [True] * 2})

print(df)
   adult size  weight
0  False    S       8
1  False    S      10
2  False    M      11
3  False    M       1
4  False    M      20
5   True    L      14
6   True    S      12

print(
    df.groupby('adult') 
       .apply(lambda subf: subf['size'][subf['weight'].idxmax()]).reset_index(name='size2')
    )               
   adult size2
0  False     M
1   True     L

print(
    pd.merge(df, 
             df.groupby('adult')
               .apply(lambda subf: subf['size'][subf['weight'].idxmax()]
                     ).reset_index(name='size2'), on=['adult'])
      )          
   adult size  weight size2
0  False    S       8     M
1  False    S      10     M
2  False    M      11     M
3  False    M       1     M
4  False    M      20     M
5   True    L      14     L
6   True    S      12     L

answered Oct 18 '22 21:10

jezrael

Related questions
                            
                                Custom iteration behavior in dict subclass
                            
                                Pylint complains "no value for argument 'cls'"
                            
                                How do I call the Google Vision API with an image stored in Google Cloud Storage?
                            
                                How to extract a Google link's href from search results with Selenium?
                            
                                How to have different results for 'list' (players/) and 'detail' (players/{id})?
                            
                                Matplotlib multiprocessing fonts corruption using savefig
                            
                                Understanding difference between Double Quote and Single Quote with __repr__()
                            
                                Python: Counting cumulative occurrences of values in a pandas series
                            
                                Pandas: inverse of value_counts function
                            
                                'utf-8' decode error in tensorflow tutorial
                            
                                How to draw thick anti-aliased lines in SciPy?
                            
                                How can I split a DataFrame column with datetimes into two columns: one with dates and one with times of the day?
                            
                                Find the number of digits after the decimal point
                            
                                Dynamo DB Increment/Unique id generation
                            
                                Regex, find pattern only in middle of string
                            
                                Error was: No module named postgresql.base
                            
                                spark finding max value and the associated key
                            
                                Get absolute path of shared library in Python
                            
                                How can I tell a Python script to halt for debugger attach to process?
                            
                                Using __new__ in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With