Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas fill missing dates in time series

Tags:

python

pandas

I have a dataframe which has aggregated data for some days. I want to add in the missing days

I was following another post, Add missing dates to pandas dataframe, unfortunately, it overwrote my results (maybe functionality was changed slightly?)... the code is below

import random import datetime as dt import numpy as np import pandas as pd  def generate_row(year, month, day):     while True:         date = dt.datetime(year=year, month=month, day=day)         data = np.random.random(size=4)         yield [date] + list(data)  # days I have data for dates = [(2000, 1, 1), (2000, 1, 2), (2000, 2, 4)] generators = [generate_row(*date) for date in dates]  # get 5 data points for each data = [next(generator) for generator in generators for _ in range(5)]  df = pd.DataFrame(data, columns=['date'] + ['f'+str(i) for i in range(1,5)])  # df groupby_day = df.groupby(pd.PeriodIndex(data=df.date, freq='D')) results = groupby_day.sum()  idx = pd.date_range(min(df.date), max(df.date)) results.reindex(idx, fill_value=0) 

Results before filling in missing date indices
enter image description here

Results after
enter image description here

like image 936
Alter Avatar asked Nov 10 '17 21:11

Alter


People also ask

How do I add a missing date in time series pandas?

To add missing dates to Python Pandas DataFrame, we can use the DatetimeIndex instance's reindex method. We create a date range index with idx = pd. date_range('09-01-2020', '09-30-2020') .


1 Answers

You need to use period_range rather than date_range:

In [11]: idx = pd.period_range(min(df.date), max(df.date))     ...: results.reindex(idx, fill_value=0)     ...: Out[11]:                   f1        f2        f3        f4 2000-01-01  2.049157  1.962635  2.756154  2.224751 2000-01-02  2.675899  2.587217  1.540823  1.606150 2000-01-03  0.000000  0.000000  0.000000  0.000000 2000-01-04  0.000000  0.000000  0.000000  0.000000 2000-01-05  0.000000  0.000000  0.000000  0.000000 2000-01-06  0.000000  0.000000  0.000000  0.000000 2000-01-07  0.000000  0.000000  0.000000  0.000000 2000-01-08  0.000000  0.000000  0.000000  0.000000 2000-01-09  0.000000  0.000000  0.000000  0.000000 2000-01-10  0.000000  0.000000  0.000000  0.000000 2000-01-11  0.000000  0.000000  0.000000  0.000000 2000-01-12  0.000000  0.000000  0.000000  0.000000 2000-01-13  0.000000  0.000000  0.000000  0.000000 2000-01-14  0.000000  0.000000  0.000000  0.000000 2000-01-15  0.000000  0.000000  0.000000  0.000000 2000-01-16  0.000000  0.000000  0.000000  0.000000 2000-01-17  0.000000  0.000000  0.000000  0.000000 2000-01-18  0.000000  0.000000  0.000000  0.000000 2000-01-19  0.000000  0.000000  0.000000  0.000000 2000-01-20  0.000000  0.000000  0.000000  0.000000 2000-01-21  0.000000  0.000000  0.000000  0.000000 2000-01-22  0.000000  0.000000  0.000000  0.000000 2000-01-23  0.000000  0.000000  0.000000  0.000000 2000-01-24  0.000000  0.000000  0.000000  0.000000 2000-01-25  0.000000  0.000000  0.000000  0.000000 2000-01-26  0.000000  0.000000  0.000000  0.000000 2000-01-27  0.000000  0.000000  0.000000  0.000000 2000-01-28  0.000000  0.000000  0.000000  0.000000 2000-01-29  0.000000  0.000000  0.000000  0.000000 2000-01-30  0.000000  0.000000  0.000000  0.000000 2000-01-31  0.000000  0.000000  0.000000  0.000000 2000-02-01  0.000000  0.000000  0.000000  0.000000 2000-02-02  0.000000  0.000000  0.000000  0.000000 2000-02-03  0.000000  0.000000  0.000000  0.000000 2000-02-04  1.856158  2.892620  2.986166  2.793448 

This is because your groupby uses PeriodIndex, rather than datetime:

df.groupby(pd.PeriodIndex(data=df.date, freq='D')) 

You could have instead used a pd.Grouper:

df.groupby(pd.Grouper(key="date", freq='D')) 

which would have give a datetime index.

like image 147
Andy Hayden Avatar answered Sep 20 '22 13:09

Andy Hayden