Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Inserting "missing" multiindex rows into a Pandas Dataframe

I have a pandas DataFrame with a two-level multiindex. The second level is numeric and supposed to be sorted and sequential for each unique value of the first-level index, but has gaps. How do I insert the "missing" rows? Sample input:

import pandas as pd
df = pd.DataFrame(list(range(5)),
                  index=pd.MultiIndex.from_tuples([('A',1), ('A',3),
                                                   ('B',2), ('B',3), ('B',6)]),
                  columns='value')
#     value
#A 1      0
#  3      1
#B 2      2
#  3      3
#  6      4

Expected output:

#     value
#A 1      0
#  2    NaN
#  3      1
#B 2      2
#  3      3
#  4    NaN
#  5    NaN
#  6      4

I suspect I could have used resample, but I am having trouble converting the numbers to anything date-like.

like image 746
DYZ Avatar asked Jan 30 '19 21:01

DYZ


2 Answers

If there is a will, there is a way. I am not proud of this but, I think it works.

Try:

def f(x):
    levels = x.index.remove_unused_levels().levels
    x = x.reindex(pd.MultiIndex.from_product([levels[0], np.arange(levels[1][0], levels[1][-1]+1)]))
    return x

df.groupby(level=0, as_index=False, group_keys=False).apply(f)

Output:

     value
A 1    0.0
  2    NaN
  3    1.0
B 2    2.0
  3    3.0
  4    NaN
  5    NaN
  6    4.0
like image 83
Scott Boston Avatar answered Sep 30 '22 14:09

Scott Boston


After much deliberations, I was able to come up with a solution myself. Judging by the fact of how lousy it is, the problem I am facing is not a very typical one.

new_index = d.index.to_frame()\
                .groupby(0)[1]\
                .apply(lambda x:
                         pd.Series(1, index=range(x.min(), x.max() + 1))).index
d.reindex(new_index)
like image 37
DYZ Avatar answered Sep 30 '22 16:09

DYZ