Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fill Gaps in time series pandas dataframe

I have a pandas dataframe with gaps in time series.
It looks like the following:

Example Input

--------------------------------------
     Timestamp        Close
 2021-02-07 09:30:00  124.624 
 2021-02-07 09:31:00  124.617
 2021-02-07 10:04:00  123.946
 2021-02-07 16:00:00  123.300
 2021-02-09 09:04:00  125.746
 2021-02-09 09:05:00  125.646
 2021-02-09 15:58:00  125.235
 2021-02-09 15:59:00  126.987
 2021-02-09 16:00:00  127.124 

Desired Output

--------------------------------------------
     Timestamp        Close
 2021-02-07 09:30:00  124.624 
 2021-02-07 09:31:00  124.617
 2021-02-07 09:32:00  124.617
 2021-02-07 09:33:00  124.617
   'Insert a line for each minute up to the next available
   timestamp with the Close value form the last available timestamp'
 2021-02-07 10:03:00  124.617 
 2021-02-07 10:04:00  123.946
 2021-02-07 16:00:00  123.300
   'I dont want lines inserted here. As this date is not
   present in the original dataset (could be a non trading
   day so I dont want to fill this gap)'
 2021-02-09 09:04:00  125.746
 2021-02-09 09:05:00  125.646
 2021-02-09 15:58:00  125.235
   'Fill the gaps here again but only between 09:30 and 16:00 time'
 2021-02-09 15:59:00  126.987
 2021-02-09 16:00:00  127.124 

What I have tried is:

'# set the index column'
df_process.set_index('Exchange DateTime', inplace=True)

'# resample and forward fill the gaps'
df_process_out = df_process.resample(rule='1T').ffill()

'# filter and return only timestamps between 09:30 and 16:00'
df_process_out = df_process_out.between_time(start_time='09:30:00', end_time='16:00:00')

However if I do it like this it also resamples and generates new timestamps on dates that are not existent in the original dataframe. In the example above it would also generate timestamps on a minute basis for 2021-02-08

How can I avoid this?

Furthermore is there a better way to avoid resampling over the whole time.

df_process_out = df_process.resample(rule='1T').ffill()

This generates timestamps from 00:00 to 24:00 and in the next line of code I have to filter most timestamps out again. Doesn't seem efficient.

Any help/guidance would be highly appreciated
Thanks


Edit:
As requested a small sample set

df_in: Input data
df_out_error: Wrong Output Data
df_out_OK: How the output data should look like

In the following ColabNotebook I prepeared a small sample.

https://colab.research.google.com/drive/1Fps2obTv1YPDpTzXTo7ivLI5njoI-y4n?usp=sharing

Notice that this is only a small subset of the data. I'm trying to clean multiple years of data that is structured and shows missing minutes timestamps like this.

like image 982
Chris Bauer Avatar asked Oct 11 '25 10:10

Chris Bauer


1 Answers

You can achieve what you need with a combination of df.groupby() (over dates) and resampling using rule = "1Min". Try this -

df_new = (df.assign(date=df.Timestamp.dt.date)   #create new col 'date' from the timestamp
            .set_index('Timestamp')              #set timestamp as index
            .groupby('date')                     #groupby for each date
            .apply(lambda x: x.resample('1Min')  #apply resampling for 1 minute from start time to end time for that date
                   .ffill())                     #ffill values
            .reset_index('date', drop=True)      #drop index 'date' that was created by groupby
            .drop('date',1)                      #drop 'date' column created before
            .reset_index()                       #reset index to get back original 2 cols
         )

df_new

enter image description here


Explanation

1. Resampling for limited time period only

"Furthermore is there a better way to avoid resampling over the whole time. This generates timestamps from 00:00 to 24:00 and in the next line of code I have to filter most timestamps out again. Doesn't seem efficient."

As in the above solution, you can resample and then ffill (or any other type of fill) using rule = 1Min. This ensures that you are not resampling from 00:00 to 24:00 but only from the start to end time stamps available in your data. To prove, I show this applied to a single date in the data -

#filtering for a single day
ddd = df[df['date']==df.date.unique()[0]]

#applying resampling on that given day
ddd.set_index('Timestamp').resample('1Min').ffill()

enter image description here

Notice the start (09:30:00) and end (16:00:00) timestamps for the given date.

2. Applying resample over existing dates only

"In the example above it would also generate timestamps on a minute basis for 2021-02-08. How can I avoid this?"

As in the above solution, you can apply the resampling method over date groups separately. In this case, I apply the method using a lambda function after separating out the date from the timestamps. So the resample happens only for the date that exist in the dataset

df_new.Timestamp.dt.date.unique()
array([datetime.date(2021, 2, 7), datetime.date(2021, 2, 9)], dtype=object)

Notice, that the output only contains the 2 unique dates from the original dataset.

like image 67
Akshay Sehgal Avatar answered Oct 13 '25 23:10

Akshay Sehgal