Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas dataframe column based on previous rows

I have a below dataframe

         id  action   
         ================
         10   CREATED   
         10   111
         10   222
         10   333
         10   DONE      
         10   222
         10   UPDATED   
         777  CREATED    
         10   333
         10   DONE      

I would like to create a new column "check" that would be based on data in previous rows in dataframe:

  1. Find cell in action column = "DONE"
  2. Search for the first CREATED or UPDATED with the same id in previous rows, before DONE. In case its CREATED then put C in case UPDATED put U.

Output:

         id  action   check
         ================
         10   CREATED   
         10   111
         10   222
         10   333
         10   DONE      C
         10   222
         10   UPDATED   
         777  CREATED    
         10   333
         10   DONE      U

I tried to use multiple if conditions but it did not work for me. Can you pls help?

like image 930
johnt Avatar asked Jun 12 '20 16:06

johnt


People also ask

What does diff() do in pandas?

Pandas DataFrame diff() Method The diff() method returns a DataFrame with the difference between the values for each row and, by default, the previous row. Which row to compare with can be specified with the periods parameter.

How do I compare row values in pandas?

You can use the DataFrame. diff() function to find the difference between two rows in a pandas DataFrame. where: periods: The number of previous rows for calculating the difference.

What does pct_ change do in python?

The pct_change() method returns a DataFrame with the percentage difference between the values for each row and, by default, the previous row.


1 Answers

Consider a more sophisticated sample dataframe for illustration:

# print(df)
id  action   
10   CREATED   
10   111
10   222
10   333
10   DONE      
10   222
10   UPDATED   
777  CREATED    
10   333
10   DONE
777  DONE
10   CREATED
10   DONE
11   UPDATED
11   DONE     

Use:

transformer = lambda s: s[(s.eq('CREATED') | s.eq('UPDATED')).cumsum().idxmax()]

grouper = (
    lambda g: g.groupby(
        g['action'].eq('DONE').cumsum().shift().fillna(0))['action']
    .transform(transformer)
)

df['check'] = df.groupby('id').apply(grouper).droplevel(0).str[0]
df.loc[df['action'].ne('DONE'), 'check'] = ''

Explanation:

First we group the dataframe on id and apply a grouper function, then for each grouped dataframe we further group this grouped dataframe by the first occurence of DONE in the action column, so essentially we are splitting this grouped dataframe in multiple parts where each part separated from the other by the DONE value in action column. then we use transformer lambda function to transform each of this spllitted dataframes according to the first value (CREATED or UPDATED) that preceds the DONE value in action column.

Result:

# print(df)
     id   action check
0    10  CREATED      
1    10      111      
2    10      222      
3    10      333      
4    10     DONE     C
5    10      222      
6    10  UPDATED      
7   777  CREATED      
8    10      333      
9    10     DONE     U
10  777     DONE     C
11   10  CREATED      
12   10     DONE     C
13   11  UPDATED      
14   11     DONE     U
like image 107
Shubham Sharma Avatar answered Sep 30 '22 20:09

Shubham Sharma