I am working with a multi index data frame that has a date column and location_id as indices.
index_1 = ['2020-01-01', '2020-01-03', '2020-01-04']
index_2 = [100,200,300]
index = pd.MultiIndex.from_product([index_1, 
index_2], names=['Date', 'location_id'])
df = pd.DataFrame(np.random.randint(10,100,9), index)
df
                         0
Date       location_id    
2020-01-01 100          19
           200          75
           300          39
2020-01-03 100          11
           200          91
           300          80
2020-01-04 100          36
           200          56
           300          54
I want to fill in missing dates, with just one location_id and fill it with 0:
                         0
Date       location_id    
2020-01-01 100          19
           200          75
           300          39
2020-01-02 100          0
2020-01-03 100          11
           200          91
           300          80
2020-01-04 100          36
           200          56
           300          54
How can I achieve that? This is helpful but only if my data frame was not multi indexed.
To add missing dates to Python Pandas DataFrame, we can use the DatetimeIndex instance's reindex method. We create a date range index with idx = pd. date_range('09-01-2020', '09-30-2020') .
To revert the index of the dataframe from multi-index to a single index using the Pandas inbuilt function reset_index(). Returns: (Data Frame or None) DataFrame with the new index or None if inplace=True.
you can get unique value of the Date index level, generate all dates between min and max with pd.date_range and use difference with unique value of Date to get the missing one. Then reindex df with the union of the original index and a MultiIndex.from_product made of missing date and the min of the level location_id.
#unique dates
m = df.index.unique(level=0)
# reindex
df = df.reindex(df.index.union(
                   pd.MultiIndex.from_product([pd.date_range(m.min(), m.max())
                                                .difference(pd.to_datetime(m))
                                                .strftime('%Y-%m-%d'), 
                                             [df.index.get_level_values(1).min()]])), 
                fill_value=0)
print(df)
                 0
2020-01-01 100  91
           200  49
           300  19
2020-01-02 100   0
2020-01-03 100  41
           200  25
           300  51
2020-01-04 100  44
           200  40
           300  54
instead of pd.MultiIndex.from_product, you can also use product from itertools. Same result but maybe faster.
from itertools import product
df = df.reindex(df.index.union(
                  list(product(pd.date_range(m.min(), m.max())
                                 .difference(pd.to_datetime(m))
                                 .strftime('%Y-%m-%d'),
                               [df.index.get_level_values(1).min()]))),
                fill_value=0)
                        Pandas index is immutable, so you need to construct a new index. Put index level location_id to column and get unique rows and call asfreq to create rows for missing date. Assign the result to df2. Finally, use df.align to join both indices and fillna
df1 = df.reset_index(-1)
df2 = df1.loc[~df1.index.duplicated()].asfreq('D').ffill()
df_final = df.align(df2.set_index('location_id', append=True))[0].fillna(0)
Out[75]:
                           0
Date       location_id
2020-01-01 100          19.0
           200          75.0
           300          39.0
2020-01-02 100           0.0
2020-01-03 100          11.0
           200          91.0
           300          80.0
2020-01-04 100          36.0
           200          56.0
           300          54.0
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With