Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas: finding first incidences of events in df based on column values and marking as new column values

Tags:

python

pandas

I have a dataframe which looks like this:

customer_id event_date data 
1           2012-10-18    0      
1           2012-10-12    0      
1           2015-10-12    0      
2           2012-09-02    0      
2           2013-09-12    1      
3           2010-10-21    0      
3           2013-11-08    0      
3           2013-12-07    1     
3           2015-09-12    1    

I wish to add additional columns, such as 'flag_1' & 'flag_2' below, which allow myself (and other when I pass on the amended data) to filter easily.

Flag_1 is an indication of the first appearance of that customer in the data set. I have implemented this successfully by sorting: dta.sort_values(['customer_id','event_date']) and then using: dta.duplicated(['customer_id']).astype(int)

Flag_2 would be an indication of the first incidence of each customer when the column 'data' = 1.

An example of what the additional columns implemented would look like below:

customer_id event_date data flag_1 flag_2
1           2012-10-18    0      1      0
1           2012-10-12    0      0      0
1           2015-10-12    0      0      0
2           2012-09-02    0      1      0
2           2013-09-12    1      0      1
3           2010-10-21    0      1      0
3           2013-11-08    0      0      0
3           2013-12-07    1      0      1
3           2015-09-12    1      0      0

I am new to pandas and unsure how to implement the 'flag_2' column without iterating over the entire dataframe - I presume there is a quicker way to implement using inbuilt function but haven't found any posts?

Thanks

like image 337
user Avatar asked Oct 19 '22 15:10

user


1 Answers

First initialize empty flags. Use groupby to get the groups based on the customer_id. For the first flag, use loc to set the value of flag1 for the first value in each group. Use the same strategy for flag2, but first filter for cases where data has been set to one.

# Initialize empty flags
df['flag1'] = 0
df['flag2'] = 0

# Set flag1
groups = df.groupby('customer_id').groups
df.loc[[values[0] for values in groups.values()], 'flag1'] = 1

# Set flag2
groups2 = df.loc[df.data == 1, :].groupby('customer_id').groups
df.loc[[values[0] for values in groups2.values()], 'flag2'] = 1

>>> df
   customer_id  event_date  data  flag1  flag2
0            1  2012-10-18     0      1      0
1            1  2012-10-12     0      0      0
2            1  2015-10-12     0      0      0
3            2  2012-09-02     0      1      0
4            2  2013-09-12     1      0      1
5            3  2010-10-21     0      1      0
6            3  2013-11-08     0      0      0
7            3  2013-12-07     1      0      1
8            3  2015-09-12     1      0      0
like image 94
Alexander Avatar answered Nov 15 '22 07:11

Alexander