I would like to modify a pandas MultiIndex DataFrame such that each index group includes Dates between a specified range. I would like each group to fill in missing dates 2013-06-11 to 2013-12-31 with the value 0 (or NaN
).
Group A, Group B, Date, Value loc_a group_a 2013-06-11 22 2013-07-02 35 2013-07-09 14 2013-07-30 9 2013-08-06 4 2013-09-03 40 2013-10-01 18 group_b 2013-07-09 4 2013-08-06 2 2013-09-03 5 group_c 2013-07-09 1 2013-09-03 2 loc_b group_a 2013-10-01 3
I've seen a few discussions of reindex
ing, but that is for a simple (non-grouped) time-series data.
Is there an easy way to do this?
Following are some attempts I've made at accomplishing this. For example: Once I've unstacked by ['A', 'B']
, I can then reindex.
df = pd.DataFrame({'A': ['loc_a'] * 12 + ['loc_b'], 'B': ['group_a'] * 7 + ['group_b'] * 3 + ['group_c'] * 2 + ['group_a'], 'Date': ["2013-06-11", "2013-07-02", "2013-07-09", "2013-07-30", "2013-08-06", "2013-09-03", "2013-10-01", "2013-07-09", "2013-08-06", "2013-09-03", "2013-07-09", "2013-09-03", "2013-10-01"], 'Value': [22, 35, 14, 9, 4, 40, 18, 4, 2, 5, 1, 2, 3]}) df.Date = df['Date'].apply(lambda x: pd.to_datetime(x).date()) df = df.set_index(['A', 'B', 'Date']) dt_start = dt.datetime(2013,6,1) all_dates = [(dt_start + dt.timedelta(days=x)).date() for x in range(0,60)] df2 = df.unstack(['A', 'B']) df3 = df2.reindex(index=all_dates).fillna(0) df4 = df3.stack(['A', 'B']) ## df4 is about where I want to get, now I'm trying to get it back in the form of df... df5 = df4.reset_index() df6 = df5.rename(columns={'level_0' : 'Date'}) df7 = df6.groupby(['A', 'B', 'Date'])['Value'].sum()
The last few lines make me a little sad. I was hoping that at df6
I could simply set_index
back to ['A', 'B', 'Date']
, but that did not group the values as they are grouped in the initial df
DataFrame.
Any thoughts on how I can reindex the unstacked DataFrame, restack, and have the DataFrame in the same format as the original?
To add missing dates to Python Pandas DataFrame, we can use the DatetimeIndex instance's reindex method. We create a date range index with idx = pd. date_range('09-01-2020', '09-30-2020') .
You can slice a MultiIndex by providing multiple indexers. You can provide any of the selectors as if you are indexing by label, see Selection by Label, including slices, lists of labels, labels, and boolean indexers. You can use slice(None) to select all the contents of that level.
You can make a new multi index based on the Cartesian product of the levels of the existing multi index. Then, re-index your data frame using the new index.
new_index = pd.MultiIndex.from_product(df.index.levels) new_df = df.reindex(new_index) # Optional: convert missing values to zero, and convert the data back # to integers. See explanation below. new_df = new_df.fillna(0).astype(int)
That's it! The new data frame has all the possible index values. The existing data is indexed correctly.
Read on for a more detailed explanation.
import pandas as pd df = pd.DataFrame({'A': ['loc_a'] * 12 + ['loc_b'], 'B': ['group_a'] * 7 + ['group_b'] * 3 + ['group_c'] * 2 + ['group_a'], 'Date': ["2013-06-11", "2013-07-02", "2013-07-09", "2013-07-30", "2013-08-06", "2013-09-03", "2013-10-01", "2013-07-09", "2013-08-06", "2013-09-03", "2013-07-09", "2013-09-03", "2013-10-01"], 'Value': [22, 35, 14, 9, 4, 40, 18, 4, 2, 5, 1, 2, 3]}) df.Date = pd.to_datetime(df.Date) df = df.set_index(['A', 'B', 'Date'])
Here's what the sample data looks like
Value A B Date loc_a group_a 2013-06-11 22 2013-07-02 35 2013-07-09 14 2013-07-30 9 2013-08-06 4 2013-09-03 40 2013-10-01 18 group_b 2013-07-09 4 2013-08-06 2 2013-09-03 5 group_c 2013-07-09 1 2013-09-03 2 loc_b group_a 2013-10-01 3
Using from_product we can make a new multi index. This new index is the Cartesian product of all the values from all the levels of the old index.
new_index = pd.MultiIndex.from_product(df.index.levels)
Use the new index to reindex the existing data frame.
new_df = df.reindex(new_index)
All the possible combinations are now present. The missing values are null (NaN).
The expanded, re-indexed data frame looks like this:
Value loc_a group_a 2013-06-11 22.0 2013-07-02 35.0 2013-07-09 14.0 2013-07-30 9.0 2013-08-06 4.0 2013-09-03 40.0 2013-10-01 18.0 group_b 2013-06-11 NaN 2013-07-02 NaN 2013-07-09 4.0 2013-07-30 NaN 2013-08-06 2.0 2013-09-03 5.0 2013-10-01 NaN group_c 2013-06-11 NaN 2013-07-02 NaN 2013-07-09 1.0 2013-07-30 NaN 2013-08-06 NaN 2013-09-03 2.0 2013-10-01 NaN loc_b group_a 2013-06-11 NaN 2013-07-02 NaN 2013-07-09 NaN 2013-07-30 NaN 2013-08-06 NaN 2013-09-03 NaN 2013-10-01 3.0 group_b 2013-06-11 NaN 2013-07-02 NaN 2013-07-09 NaN 2013-07-30 NaN 2013-08-06 NaN 2013-09-03 NaN 2013-10-01 NaN group_c 2013-06-11 NaN 2013-07-02 NaN 2013-07-09 NaN 2013-07-30 NaN 2013-08-06 NaN 2013-09-03 NaN 2013-10-01 NaN
You can see that the data in the new data frame has been converted from ints to floats. Pandas can't have nulls in an integer column. Optionally, we can convert all the nulls to 0, and cast the data back to integers.
new_df = new_df.fillna(0).astype(int)
Result
Value loc_a group_a 2013-06-11 22 2013-07-02 35 2013-07-09 14 2013-07-30 9 2013-08-06 4 2013-09-03 40 2013-10-01 18 group_b 2013-06-11 0 2013-07-02 0 2013-07-09 4 2013-07-30 0 2013-08-06 2 2013-09-03 5 2013-10-01 0 group_c 2013-06-11 0 2013-07-02 0 2013-07-09 1 2013-07-30 0 2013-08-06 0 2013-09-03 2 2013-10-01 0 loc_b group_a 2013-06-11 0 2013-07-02 0 2013-07-09 0 2013-07-30 0 2013-08-06 0 2013-09-03 0 2013-10-01 3 group_b 2013-06-11 0 2013-07-02 0 2013-07-09 0 2013-07-30 0 2013-08-06 0 2013-09-03 0 2013-10-01 0 group_c 2013-06-11 0 2013-07-02 0 2013-07-09 0 2013-07-30 0 2013-08-06 0 2013-09-03 0 2013-10-01 0
Your question wasn't clear about exactly which dates you were missing; I'm just assuming that you want to fill NaN
for any date for which you do have an observation elsewhere. My solution will have to be amended if this assumption is faulty.
Side note: it may be nice to include a line to create the DataFrame
In [55]: df = pd.DataFrame({'A': ['loc_a'] * 12 + ['loc_b'], ....: 'B': ['group_a'] * 7 + ['group_b'] * 3 + ['group_c'] * 2 + ['group_a'], ....: 'Date': ["2013-06-11", ....: "2013-07-02", ....: "2013-07-09", ....: "2013-07-30", ....: "2013-08-06", ....: "2013-09-03", ....: "2013-10-01", ....: "2013-07-09", ....: "2013-08-06", ....: "2013-09-03", ....: "2013-07-09", ....: "2013-09-03", ....: "2013-10-01"], ....: 'Value': [22, 35, 14, 9, 4, 40, 18, 4, 2, 5, 1, 2, 3]}) In [56]: In [56]: df.Date = pd.to_datetime(df.Date) In [57]: df = df.set_index(['A', 'B', 'Date']) In [58]: In [58]: print(df) Value A B Date loc_a group_a 2013-06-11 22 2013-07-02 35 2013-07-09 14 2013-07-30 9 2013-08-06 4 2013-09-03 40 2013-10-01 18 group_b 2013-07-09 4 2013-08-06 2 2013-09-03 5 group_c 2013-07-09 1 2013-09-03 2 loc_b group_a 2013-10-01 3
To get the unobserved values filled, we'll use the unstack
and stack
methods. Unstacking will create the NaN
s we're interested in, and then we'll stack them up to work with.
In [71]: df.unstack(['A', 'B']) Out[71]: Value A loc_a loc_b B group_a group_b group_c group_a Date 2013-06-11 22 NaN NaN NaN 2013-07-02 35 NaN NaN NaN 2013-07-09 14 4 1 NaN 2013-07-30 9 NaN NaN NaN 2013-08-06 4 2 NaN NaN 2013-09-03 40 5 2 NaN 2013-10-01 18 NaN NaN 3 In [59]: df.unstack(['A', 'B']).fillna(0).stack(['A', 'B']) Out[59]: Value Date A B 2013-06-11 loc_a group_a 22 group_b 0 group_c 0 loc_b group_a 0 2013-07-02 loc_a group_a 35 group_b 0 group_c 0 loc_b group_a 0 2013-07-09 loc_a group_a 14 group_b 4 group_c 1 loc_b group_a 0 2013-07-30 loc_a group_a 9 group_b 0 group_c 0 loc_b group_a 0 2013-08-06 loc_a group_a 4 group_b 2 group_c 0 loc_b group_a 0 2013-09-03 loc_a group_a 40 group_b 5 group_c 2 loc_b group_a 0 2013-10-01 loc_a group_a 18 group_b 0 group_c 0 loc_b group_a 3
Reorder the index levels as necessary.
I had to slip that fillna(0)
in the middle there so that the NaN
s weren't dropped. stack
does have a dropna
argument. I would think that setting that to false would keep the all NaN
rows around. A bug maybe?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With