Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Insert 0-values for missing dates within MultiIndex

Tags:

python

pandas

Let's assume I have a MultiIndex which consists of the date and some categories (one for simplicity in the example below) and for each category I have a time series with values of some process. I only have a value when there was an observation and I now want to add a "0" whenever there was no observation on that date. I found a way which seems very inefficient (stacking and unstacking which will create many many columns in case of millions of categories).

import datetime as dt
import pandas as pd

days= 4
#List of all dates that should be in the index
all_dates = [datetime.date(2013, 2, 13) - dt.timedelta(days=x)
    for x in range(days)]
df = pd.DataFrame([
    (datetime.date(2013, 2, 10), 1, 4),
    (datetime.date(2013, 2, 10), 2, 7),
    (datetime.date(2013, 2, 11), 2, 7),
    (datetime.date(2013, 2, 13), 1, 2),
    (datetime.date(2013, 2, 13), 2, 3)],
    columns = ['date', 'category', 'value'])
df.set_index(['date', 'category'], inplace=True)
print df
print df.unstack().reindex(all_dates).fillna(0).stack()
# insert 0 values for missing dates
print all_dates

                        value
date       category       
2013-02-10 1             4
           2             7
2013-02-11 2             7
2013-02-13 1             2
           2             3

                      value
            category       
2013-02-13 1             2
           2             3
2013-02-12 1             0
           2             0
2013-02-11 1             0
           2             7
2013-02-10 1             4
           2             7
[datetime.date(2013, 2, 13), datetime.date(2013, 2, 12),
    datetime.date(2013, 2, 11),     datetime.date(2013, 2, 10)]

Does anybody know a smarter way to achieve the same?

EDIT: I found another possibility to achieve the same:

import datetime as dt
import pandas as pd

days= 4
#List of all dates that should be in the index
all_dates = [datetime.date(2013, 2, 13) - dt.timedelta(days=x) for x in range(days)]
df = pd.DataFrame([(datetime.date(2013, 2, 10), 1, 4, 5),
(datetime.date(2013, 2, 10), 2,1, 7),
(datetime.date(2013, 2, 10), 2,2, 7),
(datetime.date(2013, 2, 11), 2,3, 7),
(datetime.date(2013, 2, 13), 1,4, 2),
(datetime.date(2013, 2, 13), 2,4, 3)],
columns = ['date', 'category', 'cat2', 'value'])
date_col = 'date'
other_index = ['category', 'cat2']
index = [date_col] + other_index
df.set_index(index, inplace=True)
grouped = df.groupby(level=other_index)
df_list = []
for i, group in grouped:
    df_list.append(group.reset_index(level=other_index).reindex(all_dates).fillna(0))
print pd.concat(df_list).set_index(other_index, append=True)

                    value
           category cat2       
2013-02-13 1        4         2
2013-02-12 0        0         0
2013-02-11 0        0         0
2013-02-10 1        4         5
2013-02-13 0        0         0
2013-02-12 0        0         0
2013-02-11 0        0         0
2013-02-10 2        1         7
2013-02-13 0        0         0
2013-02-12 0        0         0
2013-02-11 0        0         0
2013-02-10 2        2         7
2013-02-13 0        0         0
2013-02-12 0        0         0
2013-02-11 2        3         7
2013-02-10 0        0         0
2013-02-13 2        4         3
2013-02-12 0        0         0
2013-02-11 0        0         0
2013-02-10 0        0         0
like image 824
Arthur G Avatar asked Feb 13 '13 15:02

Arthur G


2 Answers

Checkout this answer: How to fill the missing record of Pandas dataframe in pythonic way?

You can do something like:

import datetime
import pandas as pd

#make an empty dataframe with the index you want
def get_datetime(x):
    return datetime.date(2013, 2, 13)- datetime.timedelta(days=x)

all_dates = [ get_datetime(x) for x in range(4)]
categories = [1,2,3,4]
index = [ [date, cat] for cat in categories for date in all_dates ]

#this df will be just an index
df = pd.DataFrame(index)
df =print df.set_index([0,1])
df.columns = ['date', 'category']
df = df.set_index(['date', 'category'])


#now if your original df is called df_original you can reindex against the other values
df_orig = df_orig.reindex_axis(df.index)

#and to add zeros
df_orig.fillna(0)
like image 37
zach Avatar answered Oct 17 '22 08:10

zach


You can make a new multi index based on the Cartesian product of the index levels you want. Then, re-index your data frame using the new index.

(date_index, category_index) = df.index.levels
new_index = pd.MultiIndex.from_product([all_dates, category_index])
new_df = df.reindex(new_index)

# Optional: convert missing values to zero, and convert the data back
# to integers. See explanation below.
new_df = new_df.fillna(0).astype(int)

That's it! The new data frame has all the possible index values. The existing data is indexed correctly.

Read on for a more detailed explanation.


Explanation

Set up sample data

import datetime as dt
import pandas as pd

days= 4
#List of all dates that should be in the index
all_dates = [dt.date(2013, 2, 13) - dt.timedelta(days=x)
    for x in range(days)]
df = pd.DataFrame([
    (dt.date(2013, 2, 10), 1, 4),
    (dt.date(2013, 2, 10), 2, 7),
    (dt.date(2013, 2, 11), 2, 7),
    (dt.date(2013, 2, 13), 1, 2),
    (dt.date(2013, 2, 13), 2, 3)],
    columns = ['date', 'category', 'value'])
df.set_index(['date', 'category'], inplace=True)

Here's what the sample data looks like

                     value
date       category
2013-02-10 1             4
           2             7
2013-02-11 2             7
2013-02-13 1             2
           2             3

Make new index

Using from_product we can make a new multi index. This new index is the Cartesian product of all the values you pass to the function.

(date_index, category_index) = df.index.levels

new_index = pd.MultiIndex.from_product([all_dates, category_index])

Reindex

Use the new index to reindex the existing data frame.

All the possible combinations are now present. The missing values are null (NaN).

new_df = df.reindex(new_index)

Now, the expanded, re-indexed data frame looks like this:

              value
2013-02-13 1    2.0
           2    3.0
2013-02-12 1    NaN
           2    NaN
2013-02-11 1    NaN
           2    7.0
2013-02-10 1    4.0
           2    7.0

Nulls in integer column

You can see that the data in the new data frame has been converted from ints to floats. Pandas can't have nulls in an integer column. Optionally, we can convert all the nulls to 0, and cast the data back to integers.

new_df = new_df.fillna(0).astype(int)

Result

              value
2013-02-13 1      2
           2      3
2013-02-12 1      0
           2      0
2013-02-11 1      0
           2      7
2013-02-10 1      4
           2      7
like image 57
Christian Long Avatar answered Oct 17 '22 08:10

Christian Long