Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas-Add missing years in time series data with duplicate years

I have a dataset like this where data for some years are missing .

County Year Pop
12     1999 1.1
12     2001 1.2
13     1999 1.0
13     2000 1.1

I want something like

County Year Pop
12     1999 1.1
12     2000 NaN
12     2001 1.2
13     1999 1.0
13     2000 1.1
13     2001 nan

I have tried setting index to year and then using reindex with another dataframe of just years method (mentioned here Pandas: Add data for missing months) but it gives me error cant reindex with duplicate values. I have also tried df.loc but it has same issue. I even tried a full outer join with blank df of just years but that also didnt work.

How can I solve this?

like image 713
ks2882 Avatar asked May 06 '17 04:05

ks2882


1 Answers

Make a MultiIndex so you don't have duplicates:

df.set_index(['County', 'Year'], inplace=True)

Then construct a full MultiIndex with all the combinations:

index = pd.MultiIndex.from_product(df.index.levels)

Then reindex:

df.reindex(index)

The construction of the MultiIndex is untested and may need a little tweaking (e.g. if a year is entirely absent from all counties), but I think you get the idea.

like image 129
John Zwinck Avatar answered Sep 20 '22 14:09

John Zwinck