I have a data frame containing financial data sampled at 1 minute intervals. Occasionally a row or two of data might be missing.
#Example Input---------------------------------------------
open high low close
2019-02-07 16:01:00 124.624 124.627 124.647 124.617
2019-02-07 16:04:00 124.646 124.655 124.664 124.645
# Desired Ouput--------------------------------------------
open high low close
2019-02-07 16:01:00 124.624 124.627 124.647 124.617
2019-02-07 16:02:00 NaN NaN NaN NaN
2019-02-07 16:03:00 NaN NaN NaN NaN
2019-02-07 16:04:00 124.646 124.655 124.664 124.645
My current method is based off this post - Find missing minute data in time series data using pandas - which is advises only how to identify the gaps. Not how to fill them.
What I'm doing is creating a DateTimeIndex of 1min intervals. Then using this index, I create an entirely new dataframe, which can then be merged into my original dataframe thus filling the gaps. Code is shown below. It seems quite a round about way of doing this. I would like to know if there is a better way. Maybe with resampling the data?
import pandas as pd
from datetime import datetime
# Initialise prices dataframe with missing data
prices = pd.DataFrame([[datetime(2019,2,7,16,0), 124.634, 124.624, 124.65, 124.62],[datetime(2019,2,7,16,4), 124.624, 124.627, 124.647, 124.617]])
prices.columns = ['datetime','open','high','low','close']
prices = prices.set_index('datetime')
print(prices)
# Create a new dataframe with complete set of time intervals
idx_ref = pd.DatetimeIndex(start=datetime(2019,2,7,16,0), end=datetime(2019,2,7,16,4),freq='min')
df = pd.DataFrame(index=idx_ref)
# Merge the two dataframes
prices = pd.merge(df, prices, how='outer', left_index=True,
right_index=True)
print(prices)
A powerful approach to filling gaps in time series is Optimal Interpolation. This method is also known as Kriging. The advantage of this approach is that it provides a smoothed response based on the characteristics of the surrounding data and the known structure of the errors.
It can be thought of as a dict-like container for Series objects. This is the primary data structure of the Pandas. Pandas DataFrame. values attribute return a Numpy representation of the given DataFrame.
The loc property is used to access a group of rows and columns by label(s) or a boolean array. . loc[] is primarily label based, but may also be used with a boolean array.
Use DataFrame.asfreq
working with Datetimeindex
:
prices = prices.set_index('datetime').asfreq('1Min')
print(prices)
open high low close
datetime
2019-02-07 16:00:00 124.634 124.624 124.650 124.620
2019-02-07 16:01:00 NaN NaN NaN NaN
2019-02-07 16:02:00 NaN NaN NaN NaN
2019-02-07 16:03:00 NaN NaN NaN NaN
2019-02-07 16:04:00 124.624 124.627 124.647 124.617
A more manual answer would be:
from datetime import datetime, timedelta
from dateutil import parser
import pandas as pd
df = pd.DataFrame({
'a': ['2021-02-07 11:00:30', '2021-02-07 11:00:31', '2021-02-07 11:00:35'],
'b': [64.8, 64.8, 50.3]
})
max_dt = parser.parse(max(df['a']))
min_dt = parser.parse(min(df['a']))
dt_range = []
while min_dt <= max_dt:
dt_range.append(min_dt.strftime("%Y-%m-%d %H:%M:%S"))
min_dt += timedelta(seconds=1)
complete_df = pd.DataFrame({'a': dt_range})
final_df = complete_df.merge(df, how='left', on='a')
It converts the following dataframe:
a b
0 2021-02-07 11:00:30 64.8
1 2021-02-07 11:00:31 64.8
2 2021-02-07 11:00:35 50.3
to:
a b
0 2021-02-07 11:00:30 64.8
1 2021-02-07 11:00:31 64.8
2 2021-02-07 11:00:32 NaN
3 2021-02-07 11:00:33 NaN
4 2021-02-07 11:00:34 NaN
5 2021-02-07 11:00:35 50.3
which we can fill its null values later
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With