Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Excel like formulas with pandas

I have a pandas DataFrame with this format

User_id|2014-01|2014-02|2014-03|2014-04|2014-05|...|2014-12
1      |   7   | NaN   | NaN   | NaN   | NaN   |...|  NaN
2      | NaN   |   5   | NaN   | NaN   |   9   |...|  NaN
3      |   2   |   4   | NaN   | NaN   | NaN   |...|  NaN

In words, the columns are months, the index is the user_id and each cell contains an integer number, or NaN

The numbers represents actions that were taken, and an action is considered successful if 3 months after it, no other action was needed.

My goal is to find the list of successful actions

In Excel, I'd write a formula like this:

Sheet2!E5=AND(Sheet1!E5<>"NaN",Sheet1!D5="NaN",Sheet1!C5="NaN",Sheet1!B5="NaN")

And would drag it to the rest of the columns, and then I would have an indicator whether an action was successful.

How can this be done efficiently with pandas ?

Sample output:

For the example given above, the desired output should be:

User_id|2014-01|2014-02|2014-03|2014-04|2014-05|
1      |   T   |   F   |   F   |   F   |   F   |
2      |   F   |   F   |   F   |   F   |   ?   |
3      |   F   |   T   |   F   |   F   |   F   |
like image 213
Uri Goren Avatar asked Nov 01 '22 03:11

Uri Goren


1 Answers

I'm not sure how you want to deal with right-most columns (you just have a '?') but you can adjust fairly easily starting from the following code or just pad out data with placeholder numbers or NaNs:

df2 = df.copy()    
for i in range(1,len(df.columns)):
    df2.iloc[:,i] = ((df.iloc[:,i].notnull()) & 
                     (df.iloc[:,i+1:i+4].apply(lambda x: all(x.isnull()),axis=1)))

Starting data df:

   User_id  2014-01  2014-02  2014-03  2014-04  2014-05
0        1        7      NaN      NaN      NaN      NaN
1        2      NaN        5      NaN      NaN        9
2        3        2        4      NaN      NaN      NaN

Results df2:

   User_id 2014-01 2014-02 2014-03 2014-04 2014-05
0        1    True   False   False   False   False
1        2   False   False   False   False   False
2        3   False    True   False   False   False

For the aforementioned padding, you could add three placeholder columns and then tweak the remaining code slightly:

df[['pad1','pad2','pad3']] = np.nan

df2 = df.copy().iloc[:,:-3]    
for i in range(1,len(df2.columns)):
    df2.iloc[:,i] = ((df.iloc[:,i].notnull()) & 
                     (df.iloc[:,i+1:i+4].apply(lambda x: all(x.isnull()),axis=1)))

And now you have one 'True' in the last column:

   User_id 2014-01 2014-02 2014-03 2014-04 2014-05
0        1    True   False   False   False   False
1        2   False   False   False   False    True
2        3   False    True   False   False   False
like image 60
JohnE Avatar answered Nov 15 '22 04:11

JohnE