Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas - idxmin on multiple columns with keeping all ties

I have a DF that looks like this:

             Virus         Host  blastRank  crisprRank  mashRank
0      NC_000866|1  NC_017660|1        1.0         inf       inf
1      NC_000871|1  NC_017595|1        1.0         inf       inf
2      NC_000872|1  NC_017595|1        1.0         inf       inf
3      NC_000896|1  NC_008530|1        1.0         inf       inf
4      NC_000902|1  NC_011353|1        1.0         inf       inf
...            ...          ...        ...         ...       ...
51935  NC_024392|1  NC_021824|1        inf         inf       1.0
51936  NC_024392|1  NC_021829|1        inf         inf       1.0
51937  NC_024392|1  NC_021837|1        inf         inf       1.0
51938  NC_024392|1  NC_021872|1        inf         inf       1.0
51939  NC_024392|1  NC_022737|1        inf         inf       1.0

What I would like to have is to group this df by Virus and for each group take rows that are equal to min in each column (first row is row in which column blastRank is min, second row is row in which column crisprRank is min etc.). If there are multiple min values, then I would like to keep all columns. I also have to do it in a way, that will support more than just those 3 columns (my program have to support more than 3 numeric columns, that's why I use df[df.columns.to_list()[2:]]

This is my code and df that it produces:

df = df.groupby(['Virus'], as_index=False).apply(lambda x: x.loc[x[x.columns.to_list()[2:]].idxmin()].reset_index(drop=True))


             Virus         Host  blastRank  crisprRank  mashRank
0   0  NC_000866|1  NC_017660|1        1.0         inf       inf
    1  NC_000866|1  NC_017660|1        1.0         inf       inf
    2  NC_000866|1  NC_002163|1        inf         inf       1.0
1   0  NC_000871|1  NC_017595|1        1.0         inf       inf
    1  NC_000871|1  NC_006449|1        inf         1.0       1.0
...            ...          ...        ...         ...       ...
818 1  NC_024391|1  NC_009641|1        1.0         inf       inf
    2  NC_024391|1  NC_003103|1        inf         inf       1.0
819 0  NC_024392|1  NC_021823|1        1.0         1.0       inf
    1  NC_024392|1  NC_021823|1        1.0         1.0       inf
    2  NC_024392|1  NC_003212|1        inf         inf       1.0

As you can see, the idxmin() returns only the first min value. I would like to do something like idxmin(keep='all') to get all the ties.

like image 714
777moneymaker Avatar asked Nov 22 '25 19:11

777moneymaker


2 Answers

I think you need test minimal values per groups for all ties:

cols = df.columns.to_list()[2:]

f = lambda x: x.apply(lambda x: x[x == x.min()].reset_index(drop=True))
df = df.groupby(['Virus'])[cols].apply(f)

If need all values in original order:

cols = df.columns.to_list()[2:]

f = lambda x: x[cols].where(x[cols].eq(x[cols].min()))
df[cols] = df.groupby(['Virus'], as_index=False).apply(f)
df = df.dropna(subset=cols, how='all')

Or:

df = df.melt(['Virus','Host'])
df1 = df[df.groupby(['Virus','variable'])['value'].transform('min').eq(df['value'])].copy()
df1 = df1.pivot(['Virus','Host'],'variable','value')

print (df1)
like image 55
jezrael Avatar answered Nov 24 '25 08:11

jezrael


Here's one way to solve it:

import numpy as np
import pandas as pd
from io import StringIO

data = StringIO("""
             Virus         Host  blastRank  crisprRank  mashRank
0      NC_000866|1  NC_017660|1        1.0         5       8
1      NC_000866|1  NC_017595|1        2.0         4       5
2      NC_000872|1  NC_017595|1        3.0         3       10
3      NC_000872|1  NC_008530|1        4.0         0       3
4      NC_000872|1  NC_011353|1        5.0         1       -3
""")
df = pd.read_csv(data, sep='\s+').convert_dtypes()

cols_of_interest = [c for c in df.columns if c not in ['Virus', 'Host']]

def get_all_min(sdf):    
    sdf_min = sdf.min().to_frame().T
    result = pd.concat([pd.merge(sdf, sdf_min[[c]], how='inner') for c in sdf_min.columns if c in cols_of_interest])
    result = result.drop_duplicates().reset_index(drop=True)
    return result

df.groupby('Virus', as_index=False).apply(get_all_min).reset_index(drop=True)
like image 30
SultanOrazbayev Avatar answered Nov 24 '25 08:11

SultanOrazbayev



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!