Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keep other columns when doing groupby

I'm using groupby on a pandas dataframe to drop all rows that don't have the minimum of a specific column. Something like this:

df1 = df.groupby("item", as_index=False)["diff"].min()

However, if I have more than those two columns, the other columns (e.g. otherstuff in my example) get dropped. Can I keep those columns using groupby, or am I going to have to find a different way to drop the rows?

My data looks like:

    item    diff   otherstuff
   0   1       2            1
   1   1       1            2
   2   1       3            7
   3   2      -1            0
   4   2       1            3
   5   2       4            9
   6   2      -6            2
   7   3       0            0
   8   3       2            9

and should end up like:

    item   diff  otherstuff
   0   1      1           2
   1   2     -6           2
   2   3      0           0

but what I'm getting is:

    item   diff
   0   1      1           
   1   2     -6           
   2   3      0                 

I've been looking through the documentation and can't find anything. I tried:

df1 = df.groupby(["item", "otherstuff"], as_index=false)["diff"].min()

df1 = df.groupby("item", as_index=false)["diff"].min()["otherstuff"]

df1 = df.groupby("item", as_index=false)["otherstuff", "diff"].min()

But none of those work (I realized with the last one that the syntax is meant for aggregating after a group is created).

like image 581
PointXIV Avatar asked Apr 30 '14 17:04

PointXIV


People also ask

How do you keep all columns when using Groupby Pandas?

agg() function allows you to choose what to do with the columns you don't want to apply operations on. If you just want to keep them, use . agg({'col1': 'first', 'col2': 'first', ...} . Instead of 'first' , you can also apply 'sum' , 'mean' and others.

Can you use Groupby with multiple columns in Pandas?

How to groupby multiple columns in pandas DataFrame and compute multiple aggregations? groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.

Does Groupby maintain order Pandas?

Groupby preserves the order of rows within each group. When calling apply, add group keys to index to identify pieces. Reduce the dimensionality of the return type if possible, otherwise return a consistent type.


4 Answers

Method #1: use idxmin() to get the indices of the elements of minimum diff, and then select those:

>>> df.loc[df.groupby("item")["diff"].idxmin()]
   item  diff  otherstuff
1     1     1           2
6     2    -6           2
7     3     0           0

[3 rows x 3 columns]

Method #2: sort by diff, and then take the first element in each item group:

>>> df.sort_values("diff").groupby("item", as_index=False).first()
   item  diff  otherstuff
0     1     1           2
1     2    -6           2
2     3     0           0

[3 rows x 3 columns]

Note that the resulting indices are different even though the row content is the same.

like image 83
DSM Avatar answered Oct 16 '22 09:10

DSM


You can use DataFrame.sort_values with DataFrame.drop_duplicates:

df = df.sort_values(by='diff').drop_duplicates(subset='item')
print (df)
   item  diff  otherstuff
6     2    -6           2
7     3     0           0
1     1     1           2

If possible multiple minimal values per groups and want all min rows use boolean indexing with transform for minimal values per groups:

print (df)
   item  diff  otherstuff
0     1     2           1
1     1     1           2 <-multiple min
2     1     1           7 <-multiple min
3     2    -1           0
4     2     1           3
5     2     4           9
6     2    -6           2
7     3     0           0
8     3     2           9

print (df.groupby("item")["diff"].transform('min'))
0    1
1    1
2    1
3   -6
4   -6
5   -6
6   -6
7    0
8    0
Name: diff, dtype: int64

df = df[df.groupby("item")["diff"].transform('min') == df['diff']]
print (df)
   item  diff  otherstuff
1     1     1           2
2     1     1           7
6     2    -6           2
7     3     0           0
like image 20
jezrael Avatar answered Oct 16 '22 10:10

jezrael


The above answer worked great if there is / you want one min. In my case there could be multiple mins and I wanted all rows equal to min which .idxmin() doesn't give you. This worked

def filter_group(dfg, col):
    return dfg[dfg[col] == dfg[col].min()]

df = pd.DataFrame({'g': ['a'] * 6 + ['b'] * 6, 'v1': (list(range(3)) + list(range(3))) * 2, 'v2': range(12)})
df.groupby('g',group_keys=False).apply(lambda x: filter_group(x,'v1'))

As an aside, .filter() is also relevant to this question but didn't work for me.

like image 1
citynorman Avatar answered Oct 16 '22 11:10

citynorman


I tried everyone's method and I couldn't get it to work properly. Instead I did the process step-by-step and ended up with the correct result.

df.sort_values(by='item', inplace=True, ignore_index=True)
df.drop_duplicates(subset='diff', inplace=True, ignore_index=True)
df.sort_values(by=['diff'], inplace=True, ignore_index=True)

For a little more explanation:

  1. Sort items by the minimum value you want
  2. Drop the duplicates of the column you want to sort with
  3. Resort the data because the data is still sorted by the minimum values
like image 1
Brad123 Avatar answered Oct 16 '22 10:10

Brad123