Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reindexing only level of a MultiIndex dataframe, reindex() broken?

Tags:

python

pandas

I try to reindex a dataframe with two or more indices on one level only, but using .reindex() with the level argument, as I have seen in this answer seems to do nothing instead of the expected behavior.

Here is the code I try to make work:

import pandas as pd

dtest = pd.DataFrame([['Martin', 'room_1', 3],
                      ['Martin', 'room_2', 2],
                      ['Georges', 'room_2', 4],
                      ['Georges', 'room_1', 12]],
                     columns=['name', 'room', 'time_spent'])
dtest.set_index(['name', 'room',], inplace=True)
display(dtest)
print(dtest.reindex(
    ['room_1', 'room_2', 'room_3'], level=1, fill_value=0))

which outputs

                time_spent
name    room              
Martin  room_1           3
        room_2           2
Georges room_1          12
        room_2           4

whereas the output I am expecting is

                time_spent
name    room              
Martin  room_1           3
        room_2           2
        room_3           0
Georges room_1          12
        room_2           4
        room_3           0

reindex() literally didn't do anything.

Did I miss an important detail on the use of reindex or is something broken?

I'm using the last version of pandas with Python 3.6.7 (v3.6.7:6ec5cf24b7, Oct 20 2018, 03:02:14).

Edit after accepting the answer

I have accepted @anky_91's answer who actually provides two different solutions. As my actual use case involves more levels in the multi-index, let me underline the behaviour of his two solutions with more levels.

The starting dataframe is now given by

import pandas as pd

dtest2 = pd.DataFrame([['2020-01-05', 'Martin', 'room_1', 3],
                      ['2020-01-05', 'Martin', 'room_2', 2],
                      ['2020-01-06', 'Georges', 'room_2', 4],
                      ['2020-01-06', 'Georges', 'room_1', 12]],
                     columns=['date', 'name', 'room', 'time_spent'])
dtest2.set_index(['date', 'name', 'room',], inplace=True)
print(dtest2)

which outputs

                           time_spent
date       name    room              
2020-01-05 Martin  room_1           3
                   room_2           2
2020-01-06 Georges room_2           4
                   room_1          12

The first solution adapted to this case is

mux = pd.MultiIndex.from_product((dtest2.index.get_level_values(0).unique(),
                                  dtest2.index.get_level_values(1).unique(),
                                  ['room_1', 'room_2', 'room_3']
                            ))
final_first_solution = dtest2.reindex(mux,fill_value=0)
print(final_first_solution)

which outputs

                           time_spent
2020-01-05 Martin  room_1           3
                   room_2           2
                   room_3           0
           Georges room_1           0
                   room_2           0
                   room_3           0
2020-01-06 Martin  room_1           0
                   room_2           0
                   room_3           0
           Georges room_1          12
                   room_2           4
                   room_3           0

This is not my desired output as Martin needs to only appear on 2020-01-05 and Georges only appears on 2020-01-06. In other words, I only want to expand the index on the last level room. However, I managed to adapt the second solution to get my desired output:

final_second_solution = dtest2.unstack((0, 1)).reindex(['room_1', 'room_2', 'room_3'], fill_value=0)\
        .stack((-2, -1)).swaplevel(i=-1, j=0).swaplevel(i=1, j=0).sort_index()
print(final_second_solution)

correctly gives me

                           time_spent
date       name    room              
2020-01-05 Martin  room_1         3.0
                   room_2         2.0
                   room_3         0.0
2020-01-06 Georges room_1        12.0
                   room_2         4.0
                   room_3         0.0

hence why I accepted the answer. Thanks!

Edit 2 for a different use case

After testing more, the above solution fails when you are not adding a new room, but instead want to complete all rooms appearing in the dataframe. But the solution is simpler, as in this case unstack already creates the missing lines so we just have to use fillna instead of reindex. The following code

import pandas as pd

dtest3 = pd.DataFrame([['2020-01-05', 'Martin', 'room_1', 3],
                      ['2020-01-06', 'Georges', 'room_2', 4]],
                     columns=['date', 'name', 'room', 'time_spent'])
dtest3.set_index(['date', 'name', 'room',], inplace=True)
print(dtest3)
final_third_solution = dtest3.unstack((0, 1)).fillna(0).stack((-2, -1)).reorder_levels([1,2,0]).sort_index()
print(final_third_solution)

hence gives the desired outcome

                           time_spent
date       name    room              
2020-01-05 Martin  room_1         3.0
                   room_2         0.0
2020-01-06 Georges room_1         0.0
                   room_2         4.0

(I have also included the suggestion in the comments of using reorder_levels instead of using multiple swaplevel).

Using .fillna(0).reindex(..., fill_value=0) should encompass the two cases where you want to complete and add new elements.

like image 822
Guillermo Durand Avatar asked Dec 24 '19 16:12

Guillermo Durand


1 Answers

You would need to create a multi index for this using pd.MultiIndex.from_product() and get_level_values:

mux=(pd.MultiIndex.from_product((dtest.index.get_level_values(0).unique()
                             ,['room_1', 'room_2', 'room_3'])))
final=dtest.reindex(mux,fill_value=0)

                time_spent
Martin  room_1           3
        room_2           2
        room_3           0
Georges room_1          12
        room_2           4
        room_3           0

If you want to just add another index with the existing indexes , use Index.union to the from_product() arg:

mux=(pd.MultiIndex.from_product((dtest.index.get_level_values(0).unique()
        ,dtest.index.get_level_values(1).unique().union(['room3'],sort=False))))
final=dtest.reindex(mux,fill_value=0)

                time_spent
Martin  room_1           3
        room_2           2
        room_3           0
Georges room_1          12
        room_2           4
        room_3           0

Adding another metod with stack() and unstack() with swaplevel:

final=dtest.unstack(0).reindex(['room_1', 'room_2', 'room_3']
                     ,fill_value=0).stack().swaplevel().sort_index()

                time_spent
name    room              
Georges room_1          12
        room_2           4
        room_3           0
Martin  room_1           3
        room_2           2
        room_3           0
like image 161
anky Avatar answered Oct 17 '22 19:10

anky