Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find gaps in pandas time series dataframe sampled at 1 minute intervals and fill the gaps with new rows

Problem

I have a data frame containing financial data sampled at 1 minute intervals. Occasionally a row or two of data might be missing.

  • I'm looking for a good (simple and efficient) way to insert new rows into the dataframe at the points in which there is missing data.
  • The new rows can be empty except for the index, which contains the timestamp.

For example:

 #Example Input---------------------------------------------
                      open     high     low      close
 2019-02-07 16:01:00  124.624  124.627  124.647  124.617  
 2019-02-07 16:04:00  124.646  124.655  124.664  124.645  

 # Desired Ouput--------------------------------------------
                      open     high     low      close
 2019-02-07 16:01:00  124.624  124.627  124.647  124.617  
 2019-02-07 16:02:00  NaN      NaN      NaN      NaN
 2019-02-07 16:03:00  NaN      NaN      NaN      NaN
 2019-02-07 16:04:00  124.646  124.655  124.664  124.645 

My current method is based off this post - Find missing minute data in time series data using pandas - which is advises only how to identify the gaps. Not how to fill them.

What I'm doing is creating a DateTimeIndex of 1min intervals. Then using this index, I create an entirely new dataframe, which can then be merged into my original dataframe thus filling the gaps. Code is shown below. It seems quite a round about way of doing this. I would like to know if there is a better way. Maybe with resampling the data?

import pandas as pd
from datetime import datetime

# Initialise prices dataframe with missing data
prices = pd.DataFrame([[datetime(2019,2,7,16,0),  124.634,  124.624, 124.65,   124.62],[datetime(2019,2,7,16,4), 124.624,  124.627,  124.647,  124.617]])
prices.columns = ['datetime','open','high','low','close']
prices = prices.set_index('datetime')
print(prices)

# Create a new dataframe with complete set of time intervals
idx_ref = pd.DatetimeIndex(start=datetime(2019,2,7,16,0), end=datetime(2019,2,7,16,4),freq='min')
df = pd.DataFrame(index=idx_ref)

# Merge the two dataframes 
prices = pd.merge(df, prices, how='outer', left_index=True, 
right_index=True)
print(prices)
like image 515
Arran Duff Avatar asked Feb 08 '19 12:02

Arran Duff


People also ask

How do you handle gaps in time series data?

A powerful approach to filling gaps in time series is Optimal Interpolation. This method is also known as Kriging. The advantage of this approach is that it provides a smoothed response based on the characteristics of the surrounding data and the known structure of the errors.

What does .values do in pandas?

It can be thought of as a dict-like container for Series objects. This is the primary data structure of the Pandas. Pandas DataFrame. values attribute return a Numpy representation of the given DataFrame.

What is LOC method in pandas?

The loc property is used to access a group of rows and columns by label(s) or a boolean array. . loc[] is primarily label based, but may also be used with a boolean array.


2 Answers

Use DataFrame.asfreq working with Datetimeindex:

prices = prices.set_index('datetime').asfreq('1Min')
print(prices)
                        open     high      low    close
datetime                                               
2019-02-07 16:00:00  124.634  124.624  124.650  124.620
2019-02-07 16:01:00      NaN      NaN      NaN      NaN
2019-02-07 16:02:00      NaN      NaN      NaN      NaN
2019-02-07 16:03:00      NaN      NaN      NaN      NaN
2019-02-07 16:04:00  124.624  124.627  124.647  124.617
like image 95
jezrael Avatar answered Sep 29 '22 09:09

jezrael


A more manual answer would be:

from datetime import datetime, timedelta
from dateutil import parser

import pandas as pd



df = pd.DataFrame({
 'a': ['2021-02-07 11:00:30', '2021-02-07 11:00:31', '2021-02-07 11:00:35'],
 'b': [64.8, 64.8, 50.3]
})

max_dt = parser.parse(max(df['a']))
min_dt = parser.parse(min(df['a']))


dt_range = []
while min_dt <= max_dt:
  dt_range.append(min_dt.strftime("%Y-%m-%d %H:%M:%S"))
  min_dt += timedelta(seconds=1)


complete_df = pd.DataFrame({'a': dt_range})
final_df = complete_df.merge(df, how='left', on='a')

It converts the following dataframe:

                     a     b
0  2021-02-07 11:00:30  64.8
1  2021-02-07 11:00:31  64.8
2  2021-02-07 11:00:35  50.3

to:

                     a     b
0  2021-02-07 11:00:30  64.8
1  2021-02-07 11:00:31  64.8
2  2021-02-07 11:00:32   NaN
3  2021-02-07 11:00:33   NaN
4  2021-02-07 11:00:34   NaN
5  2021-02-07 11:00:35  50.3

which we can fill its null values later

like image 26
NaWeeD Avatar answered Sep 29 '22 07:09

NaWeeD