pandas - find first occurrence

Tags:

pandas

Suppose I have a structured dataframe as follows:

df = pd.DataFrame({"A":['a','a','a','b','b'],                    "B":[1]*5})

The A column has previously been sorted. I wish to find the first row index of where df[df.A!='a']. The end goal is to use this index to break the data frame into groups based on A.

Now I realise that there is a groupby functionality. However, the dataframe is quite large and this is a simplified toy example. Since A has been sorted already, it would be faster if I can just find the 1st index of where df.A!='a'. Therefore it is important that whatever method that you use the scanning stops once the first element is found.

642

asked Dec 21 '16 04:12

sachinruk

2 Answers

idxmax and argmax will return the position of the maximal value or the first position if the maximal value occurs more than once.

use idxmax on df.A.ne('a')

df.A.ne('a').idxmax()  3

or the numpy equivalent

(df.A.values != 'a').argmax()  3

However, if A has already been sorted, then we can use searchsorted

df.A.searchsorted('a', side='right')  array([3])

Or the numpy equivalent

df.A.values.searchsorted('a', side='right')  3

113

answered Oct 14 '22 21:10

piRSquared

I found there is first_valid_index function for Pandas DataFrames that will do the job, one could use it as follows:

df[df.A!='a'].first_valid_index()  3

However, this function seems to be very slow. Even taking the first index of the filtered dataframe is faster:

df.loc[df.A!='a','A'].index[0]

Below I compare the total time(sec) of repeating calculations 100 times for these two options and all the codes above:

                      total_time_sec    ratio wrt fastest algo searchsorted numpy:        0.0007        1.00 argmax numpy:              0.0009        1.29 for loop:                  0.0045        6.43 searchsorted pandas:       0.0075       10.71 idxmax pandas:             0.0267       38.14 index[0]:                  0.0295       42.14 first_valid_index pandas:  0.1181      168.71

Notice numpy's searchsorted is the winner and first_valid_index shows worst performance. Generally, numpy algorithms are faster, and the for loop does not do so bad, but it's just because the dataframe has very few entries.

For a dataframe with 10,000 entries where the desired entries are closer to the end the results are different, with searchsorted delivering the best performance:

                     total_time_sec ratio wrt fastest algo searchsorted numpy:        0.0007       1.00 searchsorted pandas:       0.0076      10.86 argmax numpy:              0.0117      16.71 index[0]:                  0.0815     116.43 idxmax pandas:             0.0904     129.14 first_valid_index pandas:  0.1691     241.57 for loop:                  9.6504   13786.29

The code to produce these results is below:

import timeit  # code snippet to be executed only once  mysetup = '''import pandas as pd import numpy as np df = pd.DataFrame({"A":['a','a','a','b','b'],"B":[1]*5}) '''  # code snippets whose execution time is to be measured    mycode_set = [''' df[df.A!='a'].first_valid_index() '''] message = ["first_valid_index pandas:"]  mycode_set.append( '''df.loc[df.A!='a','A'].index[0]''') message.append("index[0]: ")  mycode_set.append( '''df.A.ne('a').idxmax()''') message.append("idxmax pandas: ")  mycode_set.append(  '''(df.A.values != 'a').argmax()''') message.append("argmax numpy: ")  mycode_set.append( '''df.A.searchsorted('a', side='right')''') message.append("searchsorted pandas: ")  mycode_set.append( '''df.A.values.searchsorted('a', side='right')''' ) message.append("searchsorted numpy: ")  mycode_set.append( '''for index in range(len(df['A'])):     if df['A'][index] != 'a':         ans = index         break         ''') message.append("for loop: ")  total_time_in_sec = [] for i in range(len(mycode_set)):     mycode = mycode_set[i]     total_time_in_sec.append(np.round(timeit.timeit(setup = mysetup,\          stmt = mycode, number = 100),4))  output = pd.DataFrame(total_time_in_sec, index = message, \                       columns = ['total_time_sec' ]) output["ratio wrt fastest algo"] = \ np.round(output.total_time_sec/output["total_time_sec"].min(),2)  output = output.sort_values(by = "total_time_sec") display(output)

For the larger dataframe:

mysetup = '''import pandas as pd import numpy as np n = 10000 lt = ['a' for _ in range(n)] b = ['b' for _ in range(5)] lt[-5:] = b df = pd.DataFrame({"A":lt,"B":[1]*n}) '''

answered Oct 14 '22 21:10

Anna K.

Related questions
                            
                                Ignore by directory using Pylint
                            
                                Replace default handler of Python logger
                            
                                How to change marker border width and hatch width?
                            
                                Generic many-to-many relationships
                            
                                Weird timezone issue with pytz
                            
                                Find unique values in a Pandas dataframe, irrespective of row or column location
                            
                                parsing excel documents with python [closed]
                            
                                Convert XLSX to CSV correctly using python [closed]
                            
                                Unit Conversion in Python
                            
                                How to set timestamps on GMT/UTC on Python logging?
                            
                                How to convert Counter object to dict?
                            
                                Django Rest Framework - Get related model field in serializer
                            
                                Python logging before you run logging.basicConfig?
                            
                                Python List of np arrays to array
                            
                                Combine (join) networkx Graphs
                            
                                Is there a head and tail method for Numpy array?
                            
                                What is the best way to write the contents of a StringIO to a file?
                            
                                What is the difference between an 'sdist' .tar.gz distribution and an python egg?
                            
                                Inverse Cosine in Python
                            
                                Unexpected '{' in field name when doing string formatting

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With