I have a pandas dataframe with 50 million records and what I am trying to do is backfill based on a condition. As we can see that the timestamps for name 800A and Barber align so I assume that the data belongs to same name and it is just an error while recording the data. The same goes with name Mia.
This is just the sample data.
my dataframe looks like this.
datetime name dischargeDate HR Sp x_inc vs_inc rec_num
01-05 18:04:50 Zawisza 14-01-05 18:05:00 119 98 FALSE TRUE 6458445
01-05 18:04:55 Zawisza 14-01-05 18:05:00 120 97 FALSE TRUE 6458445
01-05 18:05:00 Zawisza 14-01-05 18:05:00 FALSE FALSE
01-29 17:58:45 800A 14-01-29 17:59:10 FALSE FALSE
01-29 17:58:50 800A 14-01-29 17:59:10 139 FALSE TRUE
01-29 17:58:55 800A 14-01-29 17:59:10 138 FALSE TRUE
01-29 17:59:00 800A 14-01-29 17:59:10 138 96 FALSE TRUE
01-29 17:59:15 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783
01-29 17:59:20 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783
01-29 17:59:25 Barber 14-01-29 18:17:15 138 95 FALSE TRUE 7192783
03-04 21:19:45 800A 15-03-05 01:00:15 FALSE FALSE
03-05 00:53:10 800A 15-03-05 01:00:15 FALSE FALSE
03-05 00:55:50 800A 15-03-05 01:00:15 94 FALSE TRUE
03-05 00:55:55 800A 15-03-05 01:00:15 81 93 FALSE TRUE
03-05 00:56:00 800A 15-03-05 01:00:15 89 93 FALSE TRUE
03-05 01:00:20 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923
03-05 01:00:25 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923
03-05 01:00:30 Mia 15-03-05 04:13:15 70 94 FALSE TRUE 6728923
Now I am trying to backfill the record numbers(rec_num) column until it maps the bool condition False False in both the x_inc and vs_inc columns.
Actual output:
datetime name dischargeDate HR Sp x_inc vs_inc rec_num
01-05 18:04:50 Zawisza 14-01-05 18:05:00 119 98 FALSE TRUE 6458445
01-05 18:04:55 Zawisza 14-01-05 18:05:00 120 97 FALSE TRUE 6458445
01-05 18:05:00 Zawisza 14-01-05 18:05:00 FALSE FALSE 7192783
01-29 17:58:45 800A 14-01-29 17:59:10 FALSE FALSE 7192783
01-29 17:58:50 800A 14-01-29 17:59:10 139 FALSE TRUE 7192783
01-29 17:58:55 800A 14-01-29 17:59:10 138 FALSE TRUE 7192783
01-29 17:59:00 800A 14-01-29 17:59:10 138 96 FALSE TRUE 7192783
01-29 17:59:15 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783
01-29 17:59:20 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783
01-29 17:59:25 Barber 14-01-29 18:17:15 138 95 FALSE TRUE 7192783
03-04 21:19:45 800A 15-03-05 01:00:15 FALSE FALSE 6728923
03-05 00:53:10 800A 15-03-05 01:00:15 FALSE FALSE 6728923
03-05 00:55:50 800A 15-03-05 01:00:15 94 FALSE TRUE 6728923
03-05 00:55:55 800A 15-03-05 01:00:15 81 93 FALSE TRUE 6728923
03-05 00:56:00 800A 15-03-05 01:00:15 89 93 FALSE TRUE 6728923
03-05 01:00:20 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923
03-05 01:00:25 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923
03-05 01:00:30 Mia 15-03-05 04:13:15 70 94 FALSE TRUE 6728923
Expected output:
datetime name dischargeDate HR Sp x_inc vs_inc rec_num
01-05 18:04:50 Zawisza 14-01-05 18:05:00 119 98 FALSE TRUE 6458445
01-05 18:04:55 Zawisza 14-01-05 18:05:00 120 97 FALSE TRUE 6458445
01-05 18:05:00 Zawisza 14-01-05 18:05:00 FALSE FALSE
01-29 17:58:45 800A 14-01-29 17:59:10 FALSE FALSE
01-29 17:58:50 800A 14-01-29 17:59:10 139 FALSE TRUE 7192783
01-29 17:58:55 800A 14-01-29 17:59:10 138 FALSE TRUE 7192783
01-29 17:59:00 800A 14-01-29 17:59:10 138 96 FALSE TRUE 7192783
01-29 17:59:15 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783
01-29 17:59:20 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783
01-29 17:59:25 Barber 14-01-29 18:17:15 138 95 FALSE TRUE 7192783
03-04 21:19:45 800A 15-03-05 01:00:15 FALSE FALSE
03-05 00:53:10 800A 15-03-05 01:00:15 FALSE FALSE
03-05 00:55:50 800A 15-03-05 01:00:15 94 FALSE TRUE 6728923
03-05 00:55:55 800A 15-03-05 01:00:15 81 93 FALSE TRUE 6728923
03-05 00:56:00 800A 15-03-05 01:00:15 89 93 FALSE TRUE 6728923
03-05 01:00:20 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923
03-05 01:00:25 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923
03-05 01:00:30 Mia 15-03-05 04:13:15 70 94 FALSE TRUE 6728923
I am using df['rec_num'].fillna(method='bfill') but it fills completely which is not my ideal solution. I would appreciate if I can get any suggestions to this problem(or if there is any better approach). Thanks in advance.
Using a boolean mask and np.where() you can use this:
cond=(df.x_inc == False) & (df.vs_inc == False) #creates a boolean mask where both columns are false
df['new_rec']=np.where(~cond,df.rec_num.bfill(),df.rec_num) #does a backfill on where condition is not met
print(df)
Note : you can reassign the values to the old column named rec_num instead of creating a new column. I added that so you could compare. Also this should be the fastest method since vectorized
datetime name dischargeDate HR Sp x_inc vs_inc rec_num new_rec
0 2019-05-01 18:04:50 Zawisza 2005-01-14 18:05:00 119.0 98.0 False True 6458445.0 6458445.0
1 2019-05-01 18:04:55 Zawisza 2005-01-14 18:05:00 120.0 97.0 False True 6458445.0 6458445.0
2 2019-05-01 18:05:00 Zawisza 2005-01-14 18:05:00 NaN NaN False False NaN NaN
3 2029-01-01 17:58:45 800A 2029-01-14 17:59:10 NaN NaN False False NaN NaN
4 2029-01-01 17:58:50 800A 2029-01-14 17:59:10 139.0 NaN False True NaN 7192783.0
5 2029-01-01 17:58:55 800A 2029-01-14 17:59:10 138.0 NaN False True NaN 7192783.0
...........................................................
...........................................................
....................................................
.....................................
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With