I am working with a dataframe where I have weight each row by its probability. Now, I want to select the row with the highest probability and I am using pandas idxmax() to do so, however when there are ties, it just returns the first row among the ones that tie. In my case, I want to get all the rows that tie. Furthermore, I am doing this as part of a research project where I am processing millions a dataframes like the one below, so keeping it fast is an issue. Example: My data looks like this: <pre class="prettyprint"><code>data = [['chr1',100,200,0.2], ['ch1',300,500,0.3], ['chr1', 300, 500, 0.3], ['chr1', 600, 800, 0.3]] </code></pre> From this list, I create a pandas dataframe as follows: <pre class="prettyprint"><code>weighted = pd.DataFrame.from_records(data,columns=['chrom','start','end','probability']) </code></pre> Which looks like this: <pre class="prettyprint"><code> chrom start end probability 0 chr1 100 200 0.2 1 ch1 300 500 0.3 2 chr1 300 500 0.3 3 chr1 600 800 0.3 </code></pre> Then select the row that fits argmax(probability) using: <pre class="prettyprint"><code>selected = weighted.ix[weighted['probability'].idxmax()] </code></pre> Which of course returns: <pre class="prettyprint"><code>chrom ch1 start 300 end 500 probability 0.3 Name: 1, dtype: object </code></pre> Is there a (fast) way to the get all the values when there are ties? thanks!

The bottleneck lies in calculating the Boolean indexer. You can bypass the overhead associated with <code>pd.Series</code> objects by performing calculations with the underlying NumPy array: <pre class="prettyprint"><code>df2 = df[df['probability'].values == df['probability'].values.max()] </code></pre> Performance benchmarking with the Pandas equivalent: <pre class="prettyprint"><code># tested on Pandas v0.19.2, Python 3.6.0 df = pd.concat([df]*100000, ignore_index=True) %timeit df['probability'].eq(df['probability'].max()) # 3.78 ms per loop %timeit df['probability'].values == df['probability'].values.max() # 416 µs per loop </code></pre>

pandas idxmax: return all rows in case of ties

Tags:

performance

python

pandas

argmax

I am working with a dataframe where I have weight each row by its probability. Now, I want to select the row with the highest probability and I am using pandas idxmax() to do so, however when there are ties, it just returns the first row among the ones that tie. In my case, I want to get all the rows that tie.

Furthermore, I am doing this as part of a research project where I am processing millions a dataframes like the one below, so keeping it fast is an issue.

Example:

My data looks like this:

data = [['chr1',100,200,0.2],
    ['ch1',300,500,0.3],
    ['chr1', 300, 500, 0.3],
    ['chr1', 600, 800, 0.3]]

From this list, I create a pandas dataframe as follows:

weighted = pd.DataFrame.from_records(data,columns=['chrom','start','end','probability'])

Which looks like this:

  chrom  start  end  probability
0  chr1    100  200          0.2
1   ch1    300  500          0.3
2  chr1    300  500          0.3
3  chr1    600  800          0.3

Then select the row that fits argmax(probability) using:

selected =  weighted.ix[weighted['probability'].idxmax()]

Which of course returns:

chrom          ch1
start          300
end            500
probability    0.3
Name: 1, dtype: object

Is there a (fast) way to the get all the values when there are ties?

thanks!

803

asked Oct 01 '18 09:10

Praderas

1 Answers

The bottleneck lies in calculating the Boolean indexer. You can bypass the overhead associated with pd.Series objects by performing calculations with the underlying NumPy array:

df2 = df[df['probability'].values == df['probability'].values.max()]

Performance benchmarking with the Pandas equivalent:

# tested on Pandas v0.19.2, Python 3.6.0

df = pd.concat([df]*100000, ignore_index=True)

%timeit df['probability'].eq(df['probability'].max())               # 3.78 ms per loop
%timeit df['probability'].values == df['probability'].values.max()  # 416 µs per loop

answered Sep 18 '22 21:09

jpp

Related questions
                            
                                Pytest: how to work around missing __init__.py in the tests folder?
                            
                                Allow positional command-line arguments with nargs to be seperated by a flag
                            
                                Pip install in Spyder
                            
                                Flask - Toggle button with dynamic label
                            
                                Importing data from an excel file using python into SQL Server
                            
                                Save Pandas DataFrames with formulas to xlsx files
                            
                                Does the lock in asyncio.Condition have other purpose besides compatibility with threading.Condition?
                            
                                error "socket.timeout: The read operation timed out" while installing a python module
                            
                                Extract specific pages of PDF and save it with Python
                            
                                Per server prefixs
                            
                                Obtain `min` and `idxmin` (or `max` and `idxmax`) at the same time ("simultaneously")?
                            
                                Why is the order of Python sets not deterministic even when PYTHONHASHSEED=0?
                            
                                How do I use google.auth instead of oauth2client in Python to get access to my Google Calendar
                            
                                How can we define one-to-one, one-to-many, many-to-one, and many-to-many LSTM neural networks in Keras? [duplicate]
                            
                                Pandas Python: Concatenate dataframes having same columns
                            
                                PyTorch Gradient Descent
                            
                                How to close aiohttp ClientSession
                            
                                Unable to run Tracking on Open CV 3.4.1 on Python 3.6.6
                            
                                Keras: Accuracy Drops While Finetuning Inception
                            
                                Pytest: pass one fixture to another

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With