Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Correct entries in date column based on time column for a timeseries dataframe

I have a timeseries dataframe that has three columns... date, time and value and it looks like this:

**date**              **time**            **value**
11.03.2020            1103                   5  
11.03.2020            0000                   10
11.03.2020            0100                   6
12.03.2020            0201                   8
12.03.2020            0305                   7
12.03.2020            0400                   4

basically the time column is incrementing by 60 (+-5) mins for every row. I want to correct my date column values in such a way that whenever the time is 0000 (+-5) the day part of the date column increments by 1 untill the next 0000 (+-5) time value is encountered and than it increments by 1 again untill the next such time value is encountered or the end of the data frame is reached.

The result should look like this:

**date**              **time**            **value**
11.03.2020            1103                   5  
12.03.2020            0000                   10
12.03.2020            0100                   6
12.03.2020            0201                   8
12.03.2020            0305                   7
12.03.2020            0400                   4

I would appreciate some help. Thanks

like image 824
Azee. Avatar asked Mar 01 '23 15:03

Azee.


2 Answers

Parse the strings in column date as datetime

df['date'] = pd.to_datetime(df['date'], dayfirst=True)

Create a boolean mask m by comparing the time column with 0000, using boolean indexing add the DateOffset of 1 days to the values in date column where the boolean mask holds true, then mask and forward fill the values in updated date column where the current date is less that previous date

m = df['time'].eq('0000')
df.loc[m, 'date'] += pd.DateOffset(days=1)
df['date'] = df['date'].mask(df['date'].diff().dt.days.lt(0)).ffill()

        date  time  value
0 2020-03-11  1103      5
1 2020-03-12  0000     10
2 2020-03-12  0100      6
3 2020-03-12  0201      8
4 2020-03-12  0305      7
5 2020-03-12  0400      4
like image 183
Shubham Sharma Avatar answered Mar 05 '23 18:03

Shubham Sharma


Note: This answer works with the sample data but will not handle end-of-month transitions correctly. Some parts may be useful for reference, but use Shubham's answer for a proper implementation.


Assuming the first day is correct, find the rows close to midnight, cumsum() them and add the first day:

df.date = pd.to_datetime(df.date, dayfirst=True)
midnight = [f'{t:04d}' for t in np.r_[2355:2360, 0:6]]
days = (df.time.isin(midnight).cumsum() # find rows +-5 of midnight and cumsum
          .add(df.date[0].day)) # add the first day to the cumsum series

# 0    11
# 1    12
# 2    12
# 3    12
# 4    12
# 5    12
# Name: time, dtype: int64

Then reconstruct date using these fixed days:

df.date = pd.to_datetime(
    df.date.dt.year.astype(str)
    + df.date.dt.month.astype(str).str.zfill(2)
    + days.astype(str).str.zfill(2))

#         date  time  value
# 0 2020-03-11  1103      5
# 1 2020-03-12  0000     10
# 2 2020-03-12  0100      6
# 3 2020-03-12  0201      8
# 4 2020-03-12  0305      7
# 5 2020-03-12  0400      4
like image 38
tdy Avatar answered Mar 05 '23 16:03

tdy