Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas - find first occurrence

Tags:

python

pandas

Suppose I have a structured dataframe as follows:

df = pd.DataFrame({"A":['a','a','a','b','b'],                    "B":[1]*5}) 

The A column has previously been sorted. I wish to find the first row index of where df[df.A!='a']. The end goal is to use this index to break the data frame into groups based on A.

Now I realise that there is a groupby functionality. However, the dataframe is quite large and this is a simplified toy example. Since A has been sorted already, it would be faster if I can just find the 1st index of where df.A!='a'. Therefore it is important that whatever method that you use the scanning stops once the first element is found.

like image 642
sachinruk Avatar asked Dec 21 '16 04:12

sachinruk


People also ask

What is first () in pandas?

Pandas DataFrame first() Method The first() method returns the first n rows, based on the specified value. The index have to be dates for this method to work as expected.

How do you get the first element in a Pandas series?

Accessing the First Element The first element is at the index 0 position. So it is accessed by mentioning the index value in the series. We can use both 0 or the custom index to fetch the value.

How do you get the first row of a data frame?

Select & print first row of dataframe using head() It will return the first row of dataframe as a dataframe object. Using the head() function, we fetched the first row of dataframe as a dataframe and then just printed it.

Does ILOC start with 0?

Since we did not assign any specific indices, pandas created integer index by default. Thus, the row labels are integers starting from 0 and going up. The row positions that are used with iloc are also integers starting from 0.


2 Answers

idxmax and argmax will return the position of the maximal value or the first position if the maximal value occurs more than once.

use idxmax on df.A.ne('a')

df.A.ne('a').idxmax()  3 

or the numpy equivalent

(df.A.values != 'a').argmax()  3 

However, if A has already been sorted, then we can use searchsorted

df.A.searchsorted('a', side='right')  array([3]) 

Or the numpy equivalent

df.A.values.searchsorted('a', side='right')  3 
like image 113
piRSquared Avatar answered Oct 14 '22 21:10

piRSquared


I found there is first_valid_index function for Pandas DataFrames that will do the job, one could use it as follows:

df[df.A!='a'].first_valid_index()  3 

However, this function seems to be very slow. Even taking the first index of the filtered dataframe is faster:

df.loc[df.A!='a','A'].index[0] 

Below I compare the total time(sec) of repeating calculations 100 times for these two options and all the codes above:

                      total_time_sec    ratio wrt fastest algo searchsorted numpy:        0.0007        1.00 argmax numpy:              0.0009        1.29 for loop:                  0.0045        6.43 searchsorted pandas:       0.0075       10.71 idxmax pandas:             0.0267       38.14 index[0]:                  0.0295       42.14 first_valid_index pandas:  0.1181      168.71 

Notice numpy's searchsorted is the winner and first_valid_index shows worst performance. Generally, numpy algorithms are faster, and the for loop does not do so bad, but it's just because the dataframe has very few entries.

For a dataframe with 10,000 entries where the desired entries are closer to the end the results are different, with searchsorted delivering the best performance:

                     total_time_sec ratio wrt fastest algo searchsorted numpy:        0.0007       1.00 searchsorted pandas:       0.0076      10.86 argmax numpy:              0.0117      16.71 index[0]:                  0.0815     116.43 idxmax pandas:             0.0904     129.14 first_valid_index pandas:  0.1691     241.57 for loop:                  9.6504   13786.29 

The code to produce these results is below:

import timeit  # code snippet to be executed only once  mysetup = '''import pandas as pd import numpy as np df = pd.DataFrame({"A":['a','a','a','b','b'],"B":[1]*5}) '''  # code snippets whose execution time is to be measured    mycode_set = [''' df[df.A!='a'].first_valid_index() '''] message = ["first_valid_index pandas:"]  mycode_set.append( '''df.loc[df.A!='a','A'].index[0]''') message.append("index[0]: ")  mycode_set.append( '''df.A.ne('a').idxmax()''') message.append("idxmax pandas: ")  mycode_set.append(  '''(df.A.values != 'a').argmax()''') message.append("argmax numpy: ")  mycode_set.append( '''df.A.searchsorted('a', side='right')''') message.append("searchsorted pandas: ")  mycode_set.append( '''df.A.values.searchsorted('a', side='right')''' ) message.append("searchsorted numpy: ")  mycode_set.append( '''for index in range(len(df['A'])):     if df['A'][index] != 'a':         ans = index         break         ''') message.append("for loop: ")  total_time_in_sec = [] for i in range(len(mycode_set)):     mycode = mycode_set[i]     total_time_in_sec.append(np.round(timeit.timeit(setup = mysetup,\          stmt = mycode, number = 100),4))  output = pd.DataFrame(total_time_in_sec, index = message, \                       columns = ['total_time_sec' ]) output["ratio wrt fastest algo"] = \ np.round(output.total_time_sec/output["total_time_sec"].min(),2)  output = output.sort_values(by = "total_time_sec") display(output) 

For the larger dataframe:

mysetup = '''import pandas as pd import numpy as np n = 10000 lt = ['a' for _ in range(n)] b = ['b' for _ in range(5)] lt[-5:] = b df = pd.DataFrame({"A":lt,"B":[1]*n}) ''' 
like image 27
Anna K. Avatar answered Oct 14 '22 21:10

Anna K.