Fillna using multiple approaches at a time - pandas

Question

I have a dataframe like as shown below

df = pd.DataFrame({'person_id': [101,101,101,101,202,202,202],
                        'start_date':['5/7/2013 09:27:00 AM','09/08/2013 11:21:00 AM','06/06/2014 08:00:00 AM','06/06/2014 05:00:00 AM','12/11/2011 10:00:00 AM','13/10/2012 12:00:00 AM','13/12/2012 11:45:00 AM'],
                        'end_date':['5/12/2013 09:27:00 AM',np.nan,'06/11/2014 08:00:00 AM',np.nan,'12/16/2011 10:00:00','10/18/2012 00:00:00',np.nan],
                        'type':['O','I','O','O','I','O','I']})
df.start_date = pd.to_datetime(df.start_date)
df['end_date'] = pd.to_datetime(df.end_date)

I would like to fillna() under the end_date column based on two approaches below

a) If NA is found in any row except last row of that person, fillna by copying the value from next row

b) If NA is found in the last row of that person fillna by adding 10 days to his start_date (because there is no next row for that person to copy from. So, we give random value of 10 days)

The rules a and b only for persons with type=I.

For persons with type=O, just fillna by copying the value from start_date.

This is what I tried. You can see am writing the same code line twice.

  df['end_date'] = np.where(df['type'].str.contains('I'),pd.DatetimeIndex(df['end_date'].bfill()),pd.DatetimeIndex(df.start_date.dt.date))
  df['end_date'] = np.where(df['type'].str.contains('I'),pd.DatetimeIndex(df['start_date'] + pd.DateOffset(10)),pd.DatetimeIndex(df.start_date.dt.date))

Any elegant and efficient way to write this as I have to apply this on a big data with 15 million rows?

I expect my output to be like as shown below

enter image description here

Shubham Sharma · Accepted Answer

Solution

s1 = df.groupby('person_id')['start_date'].shift(-1)
s1 = s1.fillna(df['start_date'] + pd.DateOffset(days=10))
s1 = df['end_date'].fillna(s1)

s2 = df['end_date'].fillna(df['start_date'])
df['end_date'] = np.where(df['type'].eq('I'), s1, s2)

Explanations

Group the dataframe on person_id and shift the column start_date one units upwards.

>>> df.groupby('person_id')['start_date'].shift(-1)

0   2013-09-08 11:21:00
1   2014-06-06 08:00:00
2   2014-06-06 05:00:00
3                   NaT
4   2012-10-13 00:00:00
5   2012-12-13 11:45:00
6                   NaT
Name: start_date, dtype: datetime64[ns]

Fill the NaN values in the shifted column with the values from start_date column after adding an offset of 10 days

>>> s1.fillna(df['start_date'] + pd.DateOffset(days=10))

0   2013-09-08 11:21:00
1   2014-06-06 08:00:00
2   2014-06-06 05:00:00
3   2014-06-16 05:00:00
4   2012-10-13 00:00:00
5   2012-12-13 11:45:00
6   2012-12-23 11:45:00
Name: start_date, dtype: datetime64[ns]

Now fill the NaN values in end_date column with the above series s1

>>>  df['end_date'].fillna(s1)

0   2013-05-12 09:27:00
1   2014-06-06 08:00:00
2   2014-06-11 08:00:00
3   2014-06-16 05:00:00
4   2011-12-16 10:00:00
5   2012-10-18 00:00:00
6   2012-12-23 11:45:00
Name: end_date, dtype: datetime64[ns]

Similarly fill the NaN values in end_date column with the values from start_date column to create a series s2

>>> df['end_date'].fillna(df['start_date'])

0   2013-05-12 09:27:00
1   2013-09-08 11:21:00
2   2014-06-11 08:00:00
3   2014-06-06 05:00:00
4   2011-12-16 10:00:00
5   2012-10-18 00:00:00
6   2012-12-13 11:45:00
Name: end_date, dtype: datetime64[ns]

Then use np.where to select the values from s1 / s2 based on the condition where the type is I or O

>>> df

   person_id          start_date            end_date type
0        101 2013-05-07 09:27:00 2013-05-12 09:27:00    O
1        101 2013-09-08 11:21:00 2014-06-06 08:00:00    I
2        101 2014-06-06 08:00:00 2014-06-11 08:00:00    O
3        101 2014-06-06 05:00:00 2014-06-06 05:00:00    O
4        202 2011-12-11 10:00:00 2011-12-16 10:00:00    I
5        202 2012-10-13 00:00:00 2012-10-18 00:00:00    O
6        202 2012-12-13 11:45:00 2012-12-23 11:45:00    I

Fillna using multiple approaches at a time - pandas

Tags:

python

datetime

pandas

dataframe

pandas-groupby

The Great

1 Answers

Solution

Explanations

Shubham Sharma

Recent Activity

Donate For Us

Fillna using multiple approaches at a time - pandas

Tags:

python

datetime

pandas

dataframe

pandas-groupby

The Great

1 Answers

Solution

Explanations

Shubham Sharma

Related questions

Recent Activity

Donate For Us