I am working with a dataframe where I have weight each row by its probability. Now, I want to select the row with the highest probability and I am using pandas idxmax() to do so, however when there are ties, it just returns the first row among the ones that tie. In my case, I want to get all the rows that tie.
Furthermore, I am doing this as part of a research project where I am processing millions a dataframes like the one below, so keeping it fast is an issue.
Example:
My data looks like this:
data = [['chr1',100,200,0.2],
['ch1',300,500,0.3],
['chr1', 300, 500, 0.3],
['chr1', 600, 800, 0.3]]
From this list, I create a pandas dataframe as follows:
weighted = pd.DataFrame.from_records(data,columns=['chrom','start','end','probability'])
Which looks like this:
chrom start end probability
0 chr1 100 200 0.2
1 ch1 300 500 0.3
2 chr1 300 500 0.3
3 chr1 600 800 0.3
Then select the row that fits argmax(probability) using:
selected = weighted.ix[weighted['probability'].idxmax()]
Which of course returns:
chrom ch1
start 300
end 500
probability 0.3
Name: 1, dtype: object
Is there a (fast) way to the get all the values when there are ties?
thanks!
A function set_option() is provided by pandas to display all rows of the data frame. display. max_rows represents the maximum number of rows that pandas will display while displaying a data frame. The default value of max_rows is 10.
To get the nth row in a Pandas DataFrame, we can use the iloc() method. For example, df. iloc[4] will return the 5th row because row numbers start from 0.
argmax() function returns the indices of the maximum value present in the input Index. If we are having more than one maximum value (i.e. maximum value is present more than once) then it returns the index of the first occurrence of the maximum value. Parameter: Doesn't take any parameter. Example #1: Use Index.
The tail() method returns the last n rows. By default, the last 5 rows are returned. You can specify the number of rows.
The bottleneck lies in calculating the Boolean indexer. You can bypass the overhead associated with pd.Series
objects by performing calculations with the underlying NumPy array:
df2 = df[df['probability'].values == df['probability'].values.max()]
Performance benchmarking with the Pandas equivalent:
# tested on Pandas v0.19.2, Python 3.6.0
df = pd.concat([df]*100000, ignore_index=True)
%timeit df['probability'].eq(df['probability'].max()) # 3.78 ms per loop
%timeit df['probability'].values == df['probability'].values.max() # 416 µs per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With