Let's assume I have a MultiIndex which consists of the date and some categories (one for simplicity in the example below) and for each category I have a time series with values of some process. I only have a value when there was an observation and I now want to add a "0" whenever there was no observation on that date. I found a way which seems very inefficient (stacking and unstacking which will create many many columns in case of millions of categories).
import datetime as dt
import pandas as pd
days= 4
#List of all dates that should be in the index
all_dates = [datetime.date(2013, 2, 13) - dt.timedelta(days=x)
for x in range(days)]
df = pd.DataFrame([
(datetime.date(2013, 2, 10), 1, 4),
(datetime.date(2013, 2, 10), 2, 7),
(datetime.date(2013, 2, 11), 2, 7),
(datetime.date(2013, 2, 13), 1, 2),
(datetime.date(2013, 2, 13), 2, 3)],
columns = ['date', 'category', 'value'])
df.set_index(['date', 'category'], inplace=True)
print df
print df.unstack().reindex(all_dates).fillna(0).stack()
# insert 0 values for missing dates
print all_dates
value
date category
2013-02-10 1 4
2 7
2013-02-11 2 7
2013-02-13 1 2
2 3
value
category
2013-02-13 1 2
2 3
2013-02-12 1 0
2 0
2013-02-11 1 0
2 7
2013-02-10 1 4
2 7
[datetime.date(2013, 2, 13), datetime.date(2013, 2, 12),
datetime.date(2013, 2, 11), datetime.date(2013, 2, 10)]
Does anybody know a smarter way to achieve the same?
EDIT: I found another possibility to achieve the same:
import datetime as dt
import pandas as pd
days= 4
#List of all dates that should be in the index
all_dates = [datetime.date(2013, 2, 13) - dt.timedelta(days=x) for x in range(days)]
df = pd.DataFrame([(datetime.date(2013, 2, 10), 1, 4, 5),
(datetime.date(2013, 2, 10), 2,1, 7),
(datetime.date(2013, 2, 10), 2,2, 7),
(datetime.date(2013, 2, 11), 2,3, 7),
(datetime.date(2013, 2, 13), 1,4, 2),
(datetime.date(2013, 2, 13), 2,4, 3)],
columns = ['date', 'category', 'cat2', 'value'])
date_col = 'date'
other_index = ['category', 'cat2']
index = [date_col] + other_index
df.set_index(index, inplace=True)
grouped = df.groupby(level=other_index)
df_list = []
for i, group in grouped:
df_list.append(group.reset_index(level=other_index).reindex(all_dates).fillna(0))
print pd.concat(df_list).set_index(other_index, append=True)
value
category cat2
2013-02-13 1 4 2
2013-02-12 0 0 0
2013-02-11 0 0 0
2013-02-10 1 4 5
2013-02-13 0 0 0
2013-02-12 0 0 0
2013-02-11 0 0 0
2013-02-10 2 1 7
2013-02-13 0 0 0
2013-02-12 0 0 0
2013-02-11 0 0 0
2013-02-10 2 2 7
2013-02-13 0 0 0
2013-02-12 0 0 0
2013-02-11 2 3 7
2013-02-10 0 0 0
2013-02-13 2 4 3
2013-02-12 0 0 0
2013-02-11 0 0 0
2013-02-10 0 0 0
Checkout this answer: How to fill the missing record of Pandas dataframe in pythonic way?
You can do something like:
import datetime
import pandas as pd
#make an empty dataframe with the index you want
def get_datetime(x):
return datetime.date(2013, 2, 13)- datetime.timedelta(days=x)
all_dates = [ get_datetime(x) for x in range(4)]
categories = [1,2,3,4]
index = [ [date, cat] for cat in categories for date in all_dates ]
#this df will be just an index
df = pd.DataFrame(index)
df =print df.set_index([0,1])
df.columns = ['date', 'category']
df = df.set_index(['date', 'category'])
#now if your original df is called df_original you can reindex against the other values
df_orig = df_orig.reindex_axis(df.index)
#and to add zeros
df_orig.fillna(0)
You can make a new multi index based on the Cartesian product of the index levels you want. Then, re-index your data frame using the new index.
(date_index, category_index) = df.index.levels
new_index = pd.MultiIndex.from_product([all_dates, category_index])
new_df = df.reindex(new_index)
# Optional: convert missing values to zero, and convert the data back
# to integers. See explanation below.
new_df = new_df.fillna(0).astype(int)
That's it! The new data frame has all the possible index values. The existing data is indexed correctly.
Read on for a more detailed explanation.
import datetime as dt
import pandas as pd
days= 4
#List of all dates that should be in the index
all_dates = [dt.date(2013, 2, 13) - dt.timedelta(days=x)
for x in range(days)]
df = pd.DataFrame([
(dt.date(2013, 2, 10), 1, 4),
(dt.date(2013, 2, 10), 2, 7),
(dt.date(2013, 2, 11), 2, 7),
(dt.date(2013, 2, 13), 1, 2),
(dt.date(2013, 2, 13), 2, 3)],
columns = ['date', 'category', 'value'])
df.set_index(['date', 'category'], inplace=True)
Here's what the sample data looks like
value
date category
2013-02-10 1 4
2 7
2013-02-11 2 7
2013-02-13 1 2
2 3
Using from_product we can make a new multi index. This new index is the Cartesian product of all the values you pass to the function.
(date_index, category_index) = df.index.levels
new_index = pd.MultiIndex.from_product([all_dates, category_index])
Use the new index to reindex the existing data frame.
All the possible combinations are now present. The missing values are null (NaN).
new_df = df.reindex(new_index)
Now, the expanded, re-indexed data frame looks like this:
value
2013-02-13 1 2.0
2 3.0
2013-02-12 1 NaN
2 NaN
2013-02-11 1 NaN
2 7.0
2013-02-10 1 4.0
2 7.0
You can see that the data in the new data frame has been converted from ints to floats. Pandas can't have nulls in an integer column. Optionally, we can convert all the nulls to 0, and cast the data back to integers.
new_df = new_df.fillna(0).astype(int)
Result
value
2013-02-13 1 2
2 3
2013-02-12 1 0
2 0
2013-02-11 1 0
2 7
2013-02-10 1 4
2 7
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With