Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Pandas: Balance an unbalanced dataset (for panel analysis)

I know this might be easy to do. I can do it in Stata but I'm trying to move to Python.

I have a big dataset that it's unbalance. It looks like this:

enter image description here

And I need to get a dataset as follows:

enter image description here

Any guidance it's welcome. Thanks a lot!

like image 207
PAstudilloE Avatar asked Sep 13 '25 11:09

PAstudilloE


1 Answers

one way is to set 'year' as another level of index with set_index, reindex using pd.MultiIndex.from_product and reset_index the data from 'year' as a column.

Example dataframe with the same structure:

import pandas as pd

df = pd.DataFrame( {'year':[2003,2004,2002,2004,2005,2006],
                    'city_code':['a']*2+['b']*4,
                    'total_tax':pd.np.random.randint(100,1000,6)},
                   index=pd.Index(data=[9]*2+[54]*4,name='id_inf'))
print(df)
       city_code  total_tax  year
id_inf                           
9              a        417  2003
9              a        950  2004
54             b        801  2002
54             b        218  2004
54             b        886  2005
54             b        855  2006

Now you can create the df_balanced with the method:

df_balanced = (df.set_index('year',append=True)
                 .reindex(pd.MultiIndex.from_product([df.index.unique(),
                                                      range(df.year.min(),df.year.max()+1)],
                                                     names=['id_inf','year']))
                 .reset_index(level=1))

And you get:

print (df_balanced)
        year city_code  total_tax
id_inf                           
9       2002       NaN        NaN
9       2003         a      417.0
9       2004         a      950.0
9       2005       NaN        NaN
9       2006       NaN        NaN
54      2002         b      801.0
54      2003       NaN        NaN
54      2004         b      218.0
54      2005         b      886.0
54      2006         b      855.0

To fill the NaN, different methods but here two ways. For the column 'city_code', you can use groupby and transform with max to get the value and for the column 'total_tax', just fillna with 0 such as:

df_balanced['city_code'] = df_balanced.groupby(level=0)['city_code'].transform(max)
df_balanced['total_tax'] = df_balanced['total_tax'].fillna(0)

print (df_balanced)
        year city_code  total_tax
id_inf                           
9       2002         a        0.0
9       2003         a      417.0
9       2004         a      950.0
9       2005         a        0.0
9       2006         a        0.0
54      2002         b      801.0
54      2003         b        0.0
54      2004         b      218.0
54      2005         b      886.0
54      2006         b      855.0
like image 163
Ben.T Avatar answered Sep 16 '25 01:09

Ben.T