Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Selecting all rows before a certain entry in a pandas dataframe

How to select the rows that before a certain value in the columns first appear?

I have a dataset of user activity and their timestamp recorded as follow:

df = pd.DataFrame([{'user_id':1, 'date':'2017-09-01', 'activity':'Open'},
                   {'user_id':1, 'date':'2017-09-02', 'activity':'Open'}
                   {'user_id':1, 'date':'2017-09-03', 'activity':'Open'}
                   {'user_id':1, 'date':'2017-09-04', 'activity':'Click'}
                   {'user_id':1, 'date':'2017-09-05', 'activity':'Purchase'}
                   {'user_id':1, 'date':'2017-09-06', 'activity':'Open'}
                   {'user_id':1, 'date':'2017-09-07', 'activity':'Open'}
                   {'user_id':2, 'date':'2017-09-04', 'activity':'Open'}
                   {'user_id':2, 'date':'2017-09-06', 'activity':'Purchase'})]

Is there a way to select all the rows that happen before the first purchase happen for each user from the data frame? In this example, the desire output will be

df = pd.DataFrame([{'user_id':1, 'date':'2017-09-01', 'activity':'Open'},
                   {'user_id':1, 'date':'2017-09-02', 'activity':'Open'}
                   {'user_id':1, 'date':'2017-09-03', 'activity':'Open'}
                   {'user_id':1, 'date':'2017-09-04', 'activity':'Click'}
                   {'user_id':2, 'date':'2017-09-04', 'activity':'Open'})]
like image 657
Matthew Mok Avatar asked Mar 09 '23 00:03

Matthew Mok


1 Answers

You can avoid explicit apply with

In [2862]: df[df['activity'].eq('Purchase').groupby(df['user_id']).cumsum().eq(0)]
Out[2862]:
  activity        date  user_id
0     Open  2017-09-01        1
1     Open  2017-09-02        1
2     Open  2017-09-03        1
3    Click  2017-09-04        1
7     Open  2017-09-04        2
like image 153
Zero Avatar answered Mar 10 '23 12:03

Zero