Im really new to Python and Datascience.
I have a large Dataset(with 100K+ rows), in this dataset i have two columns A and B. A is a Datetime column and B is string.
Column B has some NaN values, i want to fill those NaN values with latest known B column value, given the condition that my empty B column row and already filled B column row are in the same day, month and year (Column A).
Lemme explain my Self:
Let's say that's my input:
df=pd.DataFrame({'A': ["2019-03-13 08:12:23", "2019-03-13 07:10:18", "2019-03-20 08:12:23", "2019-03-13 08:12:23", "2019-03-15 10:35:53", "2019-03-20 11:12:23"], 'B': ["B1", "B0", "B13", np.nan, "B10", "B12"]})
A B
0 2019-03-13 08:12:23 B1
1 2019-03-13 07:10:18 B0
2 2019-03-20 08:12:23 B13
3 2019-03-13 08:12:23 NaN
4 2019-03-15 10:35:53 B10
5 2019-03-20 11:12:23 B12
I want to fill the NaN value with B1(B value that occurs the same day and has the biggest time given the condition that this "Biggest time" isn't ahead of the actual A column value).
So my output should look like this:
A B
0 2019-03-13 08:12:23 B1
1 2019-03-13 07:10:18 B0
2 2019-03-20 08:12:23 B13
3 2019-03-13 08:12:23 B1
4 2019-03-15 10:35:53 B10
5 2019-03-20 11:12:23 B12
I tried to achieve this with no success, the best i could do is making NaN Value to B13 using this :
df['B']=df['B'].replace({'B': {0: np.nan}}).ffill()
Can you please guys tell me what's the fatest and most economic way to achieve this?
Use groupby+ffill
as_date = pd.to_datetime(df.A)
s = np.argsort(as_date)
df['B'] = df.B.loc[s].groupby(as_date.loc[s].dt.date).ffill().loc[df.index]
A B
0 2019-03-13 08:12:23 B1
1 2019-03-13 07:10:18 B0
2 2019-03-20 08:12:23 B13
3 2019-03-13 08:12:23 B1
4 2019-03-15 10:35:53 B10
5 2019-03-20 11:12:23 B12
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With