I'm using <code>groupby</code> on a pandas dataframe to drop all rows that don't have the minimum of a specific column. Something like this: <pre class="prettyprint"><code>df1 = df.groupby("item", as_index=False)["diff"].min() </code></pre> However, if I have more than those two columns, the other columns (e.g. <code>otherstuff</code> in my example) get dropped. Can I keep those columns using <code>groupby</code>, or am I going to have to find a different way to drop the rows? My data looks like: <pre class="prettyprint"><code> item diff otherstuff 0 1 2 1 1 1 1 2 2 1 3 7 3 2 -1 0 4 2 1 3 5 2 4 9 6 2 -6 2 7 3 0 0 8 3 2 9 </code></pre> and should end up like: <pre class="prettyprint"><code> item diff otherstuff 0 1 1 2 1 2 -6 2 2 3 0 0 </code></pre> but what I'm getting is: <pre class="prettyprint"><code> item diff 0 1 1 1 2 -6 2 3 0 </code></pre> I've been looking through the documentation and can't find anything. I tried: <pre class="prettyprint"><code>df1 = df.groupby(["item", "otherstuff"], as_index=false)["diff"].min() df1 = df.groupby("item", as_index=false)["diff"].min()["otherstuff"] df1 = df.groupby("item", as_index=false)["otherstuff", "diff"].min() </code></pre> But none of those work (I realized with the last one that the syntax is meant for aggregating after a group is created).

Method #1: use <code>idxmin()</code> to get the indices of the elements of minimum <code>diff</code>, and then select those: <pre class="prettyprint"><code>>>> df.loc[df.groupby("item")["diff"].idxmin()] item diff otherstuff 1 1 1 2 6 2 -6 2 7 3 0 0 [3 rows x 3 columns] </code></pre> Method #2: sort by <code>diff</code>, and then take the first element in each <code>item</code> group: <pre class="prettyprint"><code>>>> df.sort_values("diff").groupby("item", as_index=False).first() item diff otherstuff 0 1 1 2 1 2 -6 2 2 3 0 0 [3 rows x 3 columns] </code></pre> Note that the resulting indices are different even though the row content is the same.

You can use <code>DataFrame.sort_values</code> with <code>DataFrame.drop_duplicates</code>: <pre class="prettyprint"><code>df = df.sort_values(by='diff').drop_duplicates(subset='item') print (df) item diff otherstuff 6 2 -6 2 7 3 0 0 1 1 1 2 </code></pre> If possible multiple minimal values per groups and want all min rows use <code>boolean indexing</code> with <code>transform</code> for minimal values per groups: <pre class="prettyprint"><code>print (df) item diff otherstuff 0 1 2 1 1 1 1 2 <-multiple min 2 1 1 7 <-multiple min 3 2 -1 0 4 2 1 3 5 2 4 9 6 2 -6 2 7 3 0 0 8 3 2 9 print (df.groupby("item")["diff"].transform('min')) 0 1 1 1 2 1 3 -6 4 -6 5 -6 6 -6 7 0 8 0 Name: diff, dtype: int64 df = df[df.groupby("item")["diff"].transform('min') == df['diff']] print (df) item diff otherstuff 1 1 1 2 2 1 1 7 6 2 -6 2 7 3 0 0 </code></pre>

Keep other columns when doing groupby

Tags:

python

pandas

aggregate

pandas-groupby

I'm using groupby on a pandas dataframe to drop all rows that don't have the minimum of a specific column. Something like this:

df1 = df.groupby("item", as_index=False)["diff"].min()

However, if I have more than those two columns, the other columns (e.g. otherstuff in my example) get dropped. Can I keep those columns using groupby, or am I going to have to find a different way to drop the rows?

My data looks like:

    item    diff   otherstuff
   0   1       2            1
   1   1       1            2
   2   1       3            7
   3   2      -1            0
   4   2       1            3
   5   2       4            9
   6   2      -6            2
   7   3       0            0
   8   3       2            9

and should end up like:

    item   diff  otherstuff
   0   1      1           2
   1   2     -6           2
   2   3      0           0

but what I'm getting is:

    item   diff
   0   1      1           
   1   2     -6           
   2   3      0

I've been looking through the documentation and can't find anything. I tried:

df1 = df.groupby(["item", "otherstuff"], as_index=false)["diff"].min()

df1 = df.groupby("item", as_index=false)["diff"].min()["otherstuff"]

df1 = df.groupby("item", as_index=false)["otherstuff", "diff"].min()

But none of those work (I realized with the last one that the syntax is meant for aggregating after a group is created).

581

asked Apr 30 '14 17:04

PointXIV

4 Answers

Method #1: use idxmin() to get the indices of the elements of minimum diff, and then select those:

>>> df.loc[df.groupby("item")["diff"].idxmin()]
   item  diff  otherstuff
1     1     1           2
6     2    -6           2
7     3     0           0

[3 rows x 3 columns]

Method #2: sort by diff, and then take the first element in each item group:

>>> df.sort_values("diff").groupby("item", as_index=False).first()
   item  diff  otherstuff
0     1     1           2
1     2    -6           2
2     3     0           0

[3 rows x 3 columns]

Note that the resulting indices are different even though the row content is the same.

answered Oct 16 '22 09:10

DSM

You can use DataFrame.sort_values with DataFrame.drop_duplicates:

df = df.sort_values(by='diff').drop_duplicates(subset='item')
print (df)
   item  diff  otherstuff
6     2    -6           2
7     3     0           0
1     1     1           2

If possible multiple minimal values per groups and want all min rows use boolean indexing with transform for minimal values per groups:

print (df)
   item  diff  otherstuff
0     1     2           1
1     1     1           2 <-multiple min
2     1     1           7 <-multiple min
3     2    -1           0
4     2     1           3
5     2     4           9
6     2    -6           2
7     3     0           0
8     3     2           9

print (df.groupby("item")["diff"].transform('min'))
0    1
1    1
2    1
3   -6
4   -6
5   -6
6   -6
7    0
8    0
Name: diff, dtype: int64

df = df[df.groupby("item")["diff"].transform('min') == df['diff']]
print (df)
   item  diff  otherstuff
1     1     1           2
2     1     1           7
6     2    -6           2
7     3     0           0

answered Oct 16 '22 10:10

jezrael

The above answer worked great if there is / you want one min. In my case there could be multiple mins and I wanted all rows equal to min which .idxmin() doesn't give you. This worked

def filter_group(dfg, col):
    return dfg[dfg[col] == dfg[col].min()]

df = pd.DataFrame({'g': ['a'] * 6 + ['b'] * 6, 'v1': (list(range(3)) + list(range(3))) * 2, 'v2': range(12)})
df.groupby('g',group_keys=False).apply(lambda x: filter_group(x,'v1'))

As an aside, .filter() is also relevant to this question but didn't work for me.

answered Oct 16 '22 11:10

citynorman

I tried everyone's method and I couldn't get it to work properly. Instead I did the process step-by-step and ended up with the correct result.

df.sort_values(by='item', inplace=True, ignore_index=True)
df.drop_duplicates(subset='diff', inplace=True, ignore_index=True)
df.sort_values(by=['diff'], inplace=True, ignore_index=True)

For a little more explanation:

Sort items by the minimum value you want
Drop the duplicates of the column you want to sort with
Resort the data because the data is still sorted by the minimum values

answered Oct 16 '22 10:10

Brad123

Related questions
                            
                                Concatenate rows of two dataframes in pandas
                            
                                Why is it possible to replace sometimes set() with {}?
                            
                                Python: skip comment lines marked with # in csv.DictReader
                            
                                'Can't set attribute' with new-style properties in Python
                            
                                What exactly is a "raw string regex" and how can you use it?
                            
                                Why does Python's __import__ require fromlist?
                            
                                Why are NumPy arrays so fast?
                            
                                Using Django database layer outside of Django?
                            
                                Could not find library geos_c or load any of its variants
                            
                                How to create a fix size list in python?
                            
                                WTForms: Install 'email_validator' for email validation support
                            
                                How to read datetime back from sqlite as a datetime instead of string in Python?
                            
                                Concatenate two NumPy arrays vertically
                            
                                Selenium Webdriver finding an element in a sub-element
                            
                                Python, TypeError: unhashable type: 'list'
                            
                                Pandas Plotting with Multi-Index
                            
                                What does the c underscore expression `c_` do exactly?
                            
                                How do I run tox in a project that has no setup.py?
                            
                                Slice Pandas dataframe by index values that are (not) in a list
                            
                                Regular expression: match start or whitespace

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With