Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas idxmax: return all rows in case of ties

I am working with a dataframe where I have weight each row by its probability. Now, I want to select the row with the highest probability and I am using pandas idxmax() to do so, however when there are ties, it just returns the first row among the ones that tie. In my case, I want to get all the rows that tie.

Furthermore, I am doing this as part of a research project where I am processing millions a dataframes like the one below, so keeping it fast is an issue.

Example:

My data looks like this:

data = [['chr1',100,200,0.2],
    ['ch1',300,500,0.3],
    ['chr1', 300, 500, 0.3],
    ['chr1', 600, 800, 0.3]]

From this list, I create a pandas dataframe as follows:

weighted = pd.DataFrame.from_records(data,columns=['chrom','start','end','probability'])

Which looks like this:

  chrom  start  end  probability
0  chr1    100  200          0.2
1   ch1    300  500          0.3
2  chr1    300  500          0.3
3  chr1    600  800          0.3

Then select the row that fits argmax(probability) using:

selected =  weighted.ix[weighted['probability'].idxmax()]

Which of course returns:

chrom          ch1
start          300
end            500
probability    0.3
Name: 1, dtype: object

Is there a (fast) way to the get all the values when there are ties?

thanks!

like image 803
Praderas Avatar asked Oct 01 '18 09:10

Praderas


People also ask

How do I force a panda to display all rows?

A function set_option() is provided by pandas to display all rows of the data frame. display. max_rows represents the maximum number of rows that pandas will display while displaying a data frame. The default value of max_rows is 10.

How do you grab rows in pandas?

To get the nth row in a Pandas DataFrame, we can use the iloc() method. For example, df. iloc[4] will return the 5th row because row numbers start from 0.

How do you use Argmax in a data frame?

argmax() function returns the indices of the maximum value present in the input Index. If we are having more than one maximum value (i.e. maximum value is present more than once) then it returns the index of the first occurrence of the maximum value. Parameter: Doesn't take any parameter. Example #1: Use Index.

What is a correct pandas method for returning the last rows?

The tail() method returns the last n rows. By default, the last 5 rows are returned. You can specify the number of rows.


1 Answers

The bottleneck lies in calculating the Boolean indexer. You can bypass the overhead associated with pd.Series objects by performing calculations with the underlying NumPy array:

df2 = df[df['probability'].values == df['probability'].values.max()]

Performance benchmarking with the Pandas equivalent:

# tested on Pandas v0.19.2, Python 3.6.0

df = pd.concat([df]*100000, ignore_index=True)

%timeit df['probability'].eq(df['probability'].max())               # 3.78 ms per loop
%timeit df['probability'].values == df['probability'].values.max()  # 416 µs per loop
like image 89
jpp Avatar answered Sep 18 '22 21:09

jpp