Filling in date gaps in MultiIndex Pandas Dataframe

Tags:

I would like to modify a pandas MultiIndex DataFrame such that each index group includes Dates between a specified range. I would like each group to fill in missing dates 2013-06-11 to 2013-12-31 with the value 0 (or NaN).

Group A, Group B, Date,           Value loc_a    group_a  2013-06-11      22                   2013-07-02      35                   2013-07-09      14                   2013-07-30       9                   2013-08-06       4                   2013-09-03      40                   2013-10-01      18          group_b  2013-07-09       4                   2013-08-06       2                   2013-09-03       5          group_c  2013-07-09       1                   2013-09-03       2 loc_b    group_a  2013-10-01       3

I've seen a few discussions of reindexing, but that is for a simple (non-grouped) time-series data.

Is there an easy way to do this?

Following are some attempts I've made at accomplishing this. For example: Once I've unstacked by ['A', 'B'], I can then reindex.

df = pd.DataFrame({'A': ['loc_a'] * 12 + ['loc_b'],                 'B': ['group_a'] * 7 + ['group_b'] * 3 + ['group_c'] * 2 + ['group_a'],                 'Date': ["2013-06-11",                         "2013-07-02",                         "2013-07-09",                         "2013-07-30",                         "2013-08-06",                         "2013-09-03",                         "2013-10-01",                         "2013-07-09",                         "2013-08-06",                         "2013-09-03",                         "2013-07-09",                         "2013-09-03",                         "2013-10-01"],                  'Value': [22, 35, 14,  9,  4, 40, 18, 4, 2, 5, 1, 2, 3]})  df.Date = df['Date'].apply(lambda x: pd.to_datetime(x).date()) df = df.set_index(['A', 'B', 'Date'])  dt_start = dt.datetime(2013,6,1) all_dates = [(dt_start + dt.timedelta(days=x)).date() for x in range(0,60)]  df2 = df.unstack(['A', 'B']) df3 = df2.reindex(index=all_dates).fillna(0) df4 = df3.stack(['A', 'B'])  ## df4 is about where I want to get, now I'm trying to get it back in the form of df...  df5 = df4.reset_index() df6 = df5.rename(columns={'level_0' : 'Date'}) df7 = df6.groupby(['A', 'B', 'Date'])['Value'].sum()

The last few lines make me a little sad. I was hoping that at df6 I could simply set_index back to ['A', 'B', 'Date'], but that did not group the values as they are grouped in the initial df DataFrame.

Any thoughts on how I can reindex the unstacked DataFrame, restack, and have the DataFrame in the same format as the original?

926

asked Jun 25 '13 01:06

Michael

2 Answers

You can make a new multi index based on the Cartesian product of the levels of the existing multi index. Then, re-index your data frame using the new index.

new_index = pd.MultiIndex.from_product(df.index.levels) new_df = df.reindex(new_index)  # Optional: convert missing values to zero, and convert the data back # to integers. See explanation below. new_df = new_df.fillna(0).astype(int)

That's it! The new data frame has all the possible index values. The existing data is indexed correctly.

Read on for a more detailed explanation.

Explanation

Set up sample data

import pandas as pd  df = pd.DataFrame({'A': ['loc_a'] * 12 + ['loc_b'],                    'B': ['group_a'] * 7 + ['group_b'] * 3 + ['group_c'] * 2 + ['group_a'],                    'Date': ["2013-06-11",                            "2013-07-02",                            "2013-07-09",                            "2013-07-30",                            "2013-08-06",                            "2013-09-03",                            "2013-10-01",                            "2013-07-09",                            "2013-08-06",                            "2013-09-03",                            "2013-07-09",                            "2013-09-03",                            "2013-10-01"],                     'Value': [22, 35, 14,  9,  4, 40, 18, 4, 2, 5, 1, 2, 3]})  df.Date = pd.to_datetime(df.Date)  df = df.set_index(['A', 'B', 'Date'])

Here's what the sample data looks like

                          Value A     B       Date loc_a group_a 2013-06-11     22               2013-07-02     35               2013-07-09     14               2013-07-30      9               2013-08-06      4               2013-09-03     40               2013-10-01     18       group_b 2013-07-09      4               2013-08-06      2               2013-09-03      5       group_c 2013-07-09      1               2013-09-03      2 loc_b group_a 2013-10-01      3

Make new index

Using from_product we can make a new multi index. This new index is the Cartesian product of all the values from all the levels of the old index.

new_index = pd.MultiIndex.from_product(df.index.levels)

Reindex

Use the new index to reindex the existing data frame.

new_df = df.reindex(new_index)

All the possible combinations are now present. The missing values are null (NaN).

The expanded, re-indexed data frame looks like this:

                          Value loc_a group_a 2013-06-11   22.0               2013-07-02   35.0               2013-07-09   14.0               2013-07-30    9.0               2013-08-06    4.0               2013-09-03   40.0               2013-10-01   18.0       group_b 2013-06-11    NaN               2013-07-02    NaN               2013-07-09    4.0               2013-07-30    NaN               2013-08-06    2.0               2013-09-03    5.0               2013-10-01    NaN       group_c 2013-06-11    NaN               2013-07-02    NaN               2013-07-09    1.0               2013-07-30    NaN               2013-08-06    NaN               2013-09-03    2.0               2013-10-01    NaN loc_b group_a 2013-06-11    NaN               2013-07-02    NaN               2013-07-09    NaN               2013-07-30    NaN               2013-08-06    NaN               2013-09-03    NaN               2013-10-01    3.0       group_b 2013-06-11    NaN               2013-07-02    NaN               2013-07-09    NaN               2013-07-30    NaN               2013-08-06    NaN               2013-09-03    NaN               2013-10-01    NaN       group_c 2013-06-11    NaN               2013-07-02    NaN               2013-07-09    NaN               2013-07-30    NaN               2013-08-06    NaN               2013-09-03    NaN               2013-10-01    NaN

Nulls in integer column

You can see that the data in the new data frame has been converted from ints to floats. Pandas can't have nulls in an integer column. Optionally, we can convert all the nulls to 0, and cast the data back to integers.

new_df = new_df.fillna(0).astype(int)

Result

                          Value loc_a group_a 2013-06-11     22               2013-07-02     35               2013-07-09     14               2013-07-30      9               2013-08-06      4               2013-09-03     40               2013-10-01     18       group_b 2013-06-11      0               2013-07-02      0               2013-07-09      4               2013-07-30      0               2013-08-06      2               2013-09-03      5               2013-10-01      0       group_c 2013-06-11      0               2013-07-02      0               2013-07-09      1               2013-07-30      0               2013-08-06      0               2013-09-03      2               2013-10-01      0 loc_b group_a 2013-06-11      0               2013-07-02      0               2013-07-09      0               2013-07-30      0               2013-08-06      0               2013-09-03      0               2013-10-01      3       group_b 2013-06-11      0               2013-07-02      0               2013-07-09      0               2013-07-30      0               2013-08-06      0               2013-09-03      0               2013-10-01      0       group_c 2013-06-11      0               2013-07-02      0               2013-07-09      0               2013-07-30      0               2013-08-06      0               2013-09-03      0               2013-10-01      0

answered Sep 24 '22 00:09

Christian Long

Your question wasn't clear about exactly which dates you were missing; I'm just assuming that you want to fill NaN for any date for which you do have an observation elsewhere. My solution will have to be amended if this assumption is faulty.

Side note: it may be nice to include a line to create the DataFrame

In [55]: df = pd.DataFrame({'A': ['loc_a'] * 12 + ['loc_b'],    ....:                    'B': ['group_a'] * 7 + ['group_b'] * 3 + ['group_c'] * 2 + ['group_a'],    ....:                    'Date': ["2013-06-11",    ....:                            "2013-07-02",    ....:                            "2013-07-09",    ....:                            "2013-07-30",    ....:                            "2013-08-06",    ....:                            "2013-09-03",    ....:                            "2013-10-01",    ....:                            "2013-07-09",    ....:                            "2013-08-06",    ....:                            "2013-09-03",    ....:                            "2013-07-09",    ....:                            "2013-09-03",    ....:                            "2013-10-01"],    ....:                     'Value': [22, 35, 14,  9,  4, 40, 18, 4, 2, 5, 1, 2, 3]})  In [56]:   In [56]: df.Date = pd.to_datetime(df.Date)  In [57]: df = df.set_index(['A', 'B', 'Date'])  In [58]:   In [58]: print(df)                           Value A     B       Date              loc_a group_a 2013-06-11     22               2013-07-02     35               2013-07-09     14               2013-07-30      9               2013-08-06      4               2013-09-03     40               2013-10-01     18       group_b 2013-07-09      4               2013-08-06      2               2013-09-03      5       group_c 2013-07-09      1               2013-09-03      2 loc_b group_a 2013-10-01      3

To get the unobserved values filled, we'll use the unstack and stack methods. Unstacking will create the NaNs we're interested in, and then we'll stack them up to work with.

In [71]: df.unstack(['A', 'B']) Out[71]:                Value                            A             loc_a                      loc_b B           group_a  group_b  group_c  group_a Date                                           2013-06-11       22      NaN      NaN      NaN 2013-07-02       35      NaN      NaN      NaN 2013-07-09       14        4        1      NaN 2013-07-30        9      NaN      NaN      NaN 2013-08-06        4        2      NaN      NaN 2013-09-03       40        5        2      NaN 2013-10-01       18      NaN      NaN        3   In [59]: df.unstack(['A', 'B']).fillna(0).stack(['A', 'B']) Out[59]:                            Value Date       A     B              2013-06-11 loc_a group_a     22                  group_b      0                  group_c      0            loc_b group_a      0 2013-07-02 loc_a group_a     35                  group_b      0                  group_c      0            loc_b group_a      0 2013-07-09 loc_a group_a     14                  group_b      4                  group_c      1            loc_b group_a      0 2013-07-30 loc_a group_a      9                  group_b      0                  group_c      0            loc_b group_a      0 2013-08-06 loc_a group_a      4                  group_b      2                  group_c      0            loc_b group_a      0 2013-09-03 loc_a group_a     40                  group_b      5                  group_c      2            loc_b group_a      0 2013-10-01 loc_a group_a     18                  group_b      0                  group_c      0            loc_b group_a      3

Reorder the index levels as necessary.

I had to slip that fillna(0) in the middle there so that the NaNs weren't dropped. stack does have a dropna argument. I would think that setting that to false would keep the all NaN rows around. A bug maybe?

answered Sep 26 '22 00:09

TomAugspurger

Related questions
                            
                                django content types - how to get model class of content type to create a instance?
                            
                                How to handle urllib's timeout in Python 3?
                            
                                What is a reference cycle in python?
                            
                                class variables is shared across all instances in python? [duplicate]
                            
                                How do I change button size in Python?
                            
                                InvalidRequestError: VARCHAR requires a length on dialect mysql
                            
                                Python OrderedDict iteration
                            
                                Trouble installing private github repository using pip
                            
                                How to make Ipython output a list without line breaks after elements?
                            
                                Overloading Addition, Subtraction, and Multiplication Operators
                            
                                Transpose nested list in python
                            
                                Pandas Correlation Groupby
                            
                                Pandas DataFrame stack multiple column values into single column
                            
                                portaudio.h: No such file or directory
                            
                                Seaborn heatmap not displaying all xticks and yticks
                            
                                What is Jython and is it useful at all? [closed]
                            
                                Python/Erlang: What's the difference between Twisted, Stackless, Greenlet, Eventlet, Coroutines? Are they similar to Erlang processes?
                            
                                Internals of Python list, access and resizing runtimes
                            
                                Force python to use an older version of module (than what I have installed now)
                            
                                Attach a txt file in Python smtplib

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Filling in date gaps in MultiIndex Pandas Dataframe

Tags:

python

pandas

dataframe

numpy

multi-index