I am working with a multi index data frame that has a date column and location_id as indices.
index_1 = ['2020-01-01', '2020-01-03', '2020-01-04']
index_2 = [100,200,300]
index = pd.MultiIndex.from_product([index_1,
index_2], names=['Date', 'location_id'])
df = pd.DataFrame(np.random.randint(10,100,9), index)
df
0
Date location_id
2020-01-01 100 19
200 75
300 39
2020-01-03 100 11
200 91
300 80
2020-01-04 100 36
200 56
300 54
I want to fill in missing dates, with just one location_id and fill it with 0:
0
Date location_id
2020-01-01 100 19
200 75
300 39
2020-01-02 100 0
2020-01-03 100 11
200 91
300 80
2020-01-04 100 36
200 56
300 54
How can I achieve that? This is helpful but only if my data frame was not multi indexed.
To add missing dates to Python Pandas DataFrame, we can use the DatetimeIndex instance's reindex method. We create a date range index with idx = pd. date_range('09-01-2020', '09-30-2020') .
To revert the index of the dataframe from multi-index to a single index using the Pandas inbuilt function reset_index(). Returns: (Data Frame or None) DataFrame with the new index or None if inplace=True.
you can get unique
value of the Date index level, generate all dates between min and max with pd.date_range
and use difference
with unique value of Date to get the missing one. Then reindex
df with the union
of the original index and a MultiIndex.from_product
made of missing date and the min
of the level location_id.
#unique dates
m = df.index.unique(level=0)
# reindex
df = df.reindex(df.index.union(
pd.MultiIndex.from_product([pd.date_range(m.min(), m.max())
.difference(pd.to_datetime(m))
.strftime('%Y-%m-%d'),
[df.index.get_level_values(1).min()]])),
fill_value=0)
print(df)
0
2020-01-01 100 91
200 49
300 19
2020-01-02 100 0
2020-01-03 100 41
200 25
300 51
2020-01-04 100 44
200 40
300 54
instead of pd.MultiIndex.from_product
, you can also use product
from itertools
. Same result but maybe faster.
from itertools import product
df = df.reindex(df.index.union(
list(product(pd.date_range(m.min(), m.max())
.difference(pd.to_datetime(m))
.strftime('%Y-%m-%d'),
[df.index.get_level_values(1).min()]))),
fill_value=0)
Pandas index is immutable, so you need to construct a new index. Put index level location_id
to column and get unique rows and call asfreq
to create rows for missing date. Assign the result to df2
. Finally, use df.align
to join both indices and fillna
df1 = df.reset_index(-1)
df2 = df1.loc[~df1.index.duplicated()].asfreq('D').ffill()
df_final = df.align(df2.set_index('location_id', append=True))[0].fillna(0)
Out[75]:
0
Date location_id
2020-01-01 100 19.0
200 75.0
300 39.0
2020-01-02 100 0.0
2020-01-03 100 11.0
200 91.0
300 80.0
2020-01-04 100 36.0
200 56.0
300 54.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With