I have a dataframe like as shown below
df = pd.DataFrame({'person_id': [101,101,101,101,202,202,202],
'start_date':['5/7/2013 09:27:00 AM','09/08/2013 11:21:00 AM','06/06/2014 08:00:00 AM','06/06/2014 05:00:00 AM','12/11/2011 10:00:00 AM','13/10/2012 12:00:00 AM','13/12/2012 11:45:00 AM'],
'end_date':['5/12/2013 09:27:00 AM',np.nan,'06/11/2014 08:00:00 AM',np.nan,'12/16/2011 10:00:00','10/18/2012 00:00:00',np.nan],
'type':['O','I','O','O','I','O','I']})
df.start_date = pd.to_datetime(df.start_date)
df['end_date'] = pd.to_datetime(df.end_date)
I would like to fillna() under the end_date column based on two approaches below
a) If NA is found in any row except last row of that person, fillna by copying the value from next row
b) If NA is found in the last row of that person fillna by adding 10 days to his start_date (because there is no next row for that person to copy from. So, we give random value of 10 days)
The rules a and b only for persons with type=I.
For persons with type=O, just fillna by copying the value from start_date.
This is what I tried. You can see am writing the same code line twice.
df['end_date'] = np.where(df['type'].str.contains('I'),pd.DatetimeIndex(df['end_date'].bfill()),pd.DatetimeIndex(df.start_date.dt.date))
df['end_date'] = np.where(df['type'].str.contains('I'),pd.DatetimeIndex(df['start_date'] + pd.DateOffset(10)),pd.DatetimeIndex(df.start_date.dt.date))
Any elegant and efficient way to write this as I have to apply this on a big data with 15 million rows?
I expect my output to be like as shown below

s1 = df.groupby('person_id')['start_date'].shift(-1)
s1 = s1.fillna(df['start_date'] + pd.DateOffset(days=10))
s1 = df['end_date'].fillna(s1)
s2 = df['end_date'].fillna(df['start_date'])
df['end_date'] = np.where(df['type'].eq('I'), s1, s2)
Group the dataframe on person_id and shift the column start_date one units upwards.
>>> df.groupby('person_id')['start_date'].shift(-1)
0 2013-09-08 11:21:00
1 2014-06-06 08:00:00
2 2014-06-06 05:00:00
3 NaT
4 2012-10-13 00:00:00
5 2012-12-13 11:45:00
6 NaT
Name: start_date, dtype: datetime64[ns]
Fill the NaN values in the shifted column with the values from start_date column after adding an offset of 10 days
>>> s1.fillna(df['start_date'] + pd.DateOffset(days=10))
0 2013-09-08 11:21:00
1 2014-06-06 08:00:00
2 2014-06-06 05:00:00
3 2014-06-16 05:00:00
4 2012-10-13 00:00:00
5 2012-12-13 11:45:00
6 2012-12-23 11:45:00
Name: start_date, dtype: datetime64[ns]
Now fill the NaN values in end_date column with the above series s1
>>> df['end_date'].fillna(s1)
0 2013-05-12 09:27:00
1 2014-06-06 08:00:00
2 2014-06-11 08:00:00
3 2014-06-16 05:00:00
4 2011-12-16 10:00:00
5 2012-10-18 00:00:00
6 2012-12-23 11:45:00
Name: end_date, dtype: datetime64[ns]
Similarly fill the NaN values in end_date column with the values from start_date column to create a series s2
>>> df['end_date'].fillna(df['start_date'])
0 2013-05-12 09:27:00
1 2013-09-08 11:21:00
2 2014-06-11 08:00:00
3 2014-06-06 05:00:00
4 2011-12-16 10:00:00
5 2012-10-18 00:00:00
6 2012-12-13 11:45:00
Name: end_date, dtype: datetime64[ns]
Then use np.where to select the values from s1 / s2 based on the condition where the type is I or O
>>> df
person_id start_date end_date type
0 101 2013-05-07 09:27:00 2013-05-12 09:27:00 O
1 101 2013-09-08 11:21:00 2014-06-06 08:00:00 I
2 101 2014-06-06 08:00:00 2014-06-11 08:00:00 O
3 101 2014-06-06 05:00:00 2014-06-06 05:00:00 O
4 202 2011-12-11 10:00:00 2011-12-16 10:00:00 I
5 202 2012-10-13 00:00:00 2012-10-18 00:00:00 O
6 202 2012-12-13 11:45:00 2012-12-23 11:45:00 I
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With