I have a dataframe which has aggregated data for some days. I want to add in the missing days
I was following another post, Add missing dates to pandas dataframe, unfortunately, it overwrote my results (maybe functionality was changed slightly?)... the code is below
import random import datetime as dt import numpy as np import pandas as pd def generate_row(year, month, day): while True: date = dt.datetime(year=year, month=month, day=day) data = np.random.random(size=4) yield [date] + list(data) # days I have data for dates = [(2000, 1, 1), (2000, 1, 2), (2000, 2, 4)] generators = [generate_row(*date) for date in dates] # get 5 data points for each data = [next(generator) for generator in generators for _ in range(5)] df = pd.DataFrame(data, columns=['date'] + ['f'+str(i) for i in range(1,5)]) # df groupby_day = df.groupby(pd.PeriodIndex(data=df.date, freq='D')) results = groupby_day.sum() idx = pd.date_range(min(df.date), max(df.date)) results.reindex(idx, fill_value=0)
Results before filling in missing date indices
Results after
To add missing dates to Python Pandas DataFrame, we can use the DatetimeIndex instance's reindex method. We create a date range index with idx = pd. date_range('09-01-2020', '09-30-2020') .
You need to use period_range
rather than date_range
:
In [11]: idx = pd.period_range(min(df.date), max(df.date)) ...: results.reindex(idx, fill_value=0) ...: Out[11]: f1 f2 f3 f4 2000-01-01 2.049157 1.962635 2.756154 2.224751 2000-01-02 2.675899 2.587217 1.540823 1.606150 2000-01-03 0.000000 0.000000 0.000000 0.000000 2000-01-04 0.000000 0.000000 0.000000 0.000000 2000-01-05 0.000000 0.000000 0.000000 0.000000 2000-01-06 0.000000 0.000000 0.000000 0.000000 2000-01-07 0.000000 0.000000 0.000000 0.000000 2000-01-08 0.000000 0.000000 0.000000 0.000000 2000-01-09 0.000000 0.000000 0.000000 0.000000 2000-01-10 0.000000 0.000000 0.000000 0.000000 2000-01-11 0.000000 0.000000 0.000000 0.000000 2000-01-12 0.000000 0.000000 0.000000 0.000000 2000-01-13 0.000000 0.000000 0.000000 0.000000 2000-01-14 0.000000 0.000000 0.000000 0.000000 2000-01-15 0.000000 0.000000 0.000000 0.000000 2000-01-16 0.000000 0.000000 0.000000 0.000000 2000-01-17 0.000000 0.000000 0.000000 0.000000 2000-01-18 0.000000 0.000000 0.000000 0.000000 2000-01-19 0.000000 0.000000 0.000000 0.000000 2000-01-20 0.000000 0.000000 0.000000 0.000000 2000-01-21 0.000000 0.000000 0.000000 0.000000 2000-01-22 0.000000 0.000000 0.000000 0.000000 2000-01-23 0.000000 0.000000 0.000000 0.000000 2000-01-24 0.000000 0.000000 0.000000 0.000000 2000-01-25 0.000000 0.000000 0.000000 0.000000 2000-01-26 0.000000 0.000000 0.000000 0.000000 2000-01-27 0.000000 0.000000 0.000000 0.000000 2000-01-28 0.000000 0.000000 0.000000 0.000000 2000-01-29 0.000000 0.000000 0.000000 0.000000 2000-01-30 0.000000 0.000000 0.000000 0.000000 2000-01-31 0.000000 0.000000 0.000000 0.000000 2000-02-01 0.000000 0.000000 0.000000 0.000000 2000-02-02 0.000000 0.000000 0.000000 0.000000 2000-02-03 0.000000 0.000000 0.000000 0.000000 2000-02-04 1.856158 2.892620 2.986166 2.793448
This is because your groupby uses PeriodIndex, rather than datetime:
df.groupby(pd.PeriodIndex(data=df.date, freq='D'))
You could have instead used a pd.Grouper
:
df.groupby(pd.Grouper(key="date", freq='D'))
which would have give a datetime index.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With