Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to update a subset of a MultiIndexed pandas DataFrame

Tags:

python

pandas

I'm using a MultiIndexed pandas DataFrame and would like to multiply a subset of the DataFrame by a certain number.

It's the same as this but with a MultiIndex.

>>> d = pd.DataFrame({'year':[2008,2008,2008,2008,2009,2009,2009,2009], 
                      'flavour':['strawberry','strawberry','banana','banana',
                      'strawberry','strawberry','banana','banana'],
                      'day':['sat','sun','sat','sun','sat','sun','sat','sun'],
                      'sales':[10,12,22,23,11,13,23,24]})

>>> d = d.set_index(['year','flavour','day'])                  

>>> d
                     sales
year flavour    day       
2008 strawberry sat     10
                sun     12
     banana     sat     22
                sun     23
2009 strawberry sat     11
                sun     13
     banana     sat     23
                sun     24

So far, so good. But let's say I spot that all the Saturday figures are only half what they should be! I'd like to multiply all sat sales by 2.

My first attempt at this was:

sat = d.xs('sat', level='day')
sat = sat * 2
d.update(sat)

but this doesn't work because the variable sat has lost the day level of the index:

>>> sat
                 sales
year flavour          
2008 strawberry     20
     banana         44
2009 strawberry     22
     banana         46

so pandas doesn't know how to join the new sales figures back onto the old dataframe.

I had a quick stab at:

>>> sat = d.xs('sat', level='day', copy=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2248, in xs
    raise ValueError('Cannot retrieve view (copy=False)')
ValueError: Cannot retrieve view (copy=False)

I have no idea what that error means, but I feel like I'm making a mountain out of a molehill. Does anyone know the right way to do this?

Thanks in advance, Rob

like image 238
LondonRob Avatar asked Jul 09 '13 16:07

LondonRob


2 Answers

Note: In soon to be released 0.13 a drop_level argument has been added to xs (thanks to this question!):

In [42]: df.xs('sat', level='day', drop_level=False)
Out[42]:
                     sales
year flavour    day
2008 strawberry sat     10

Another option is to use select (which extracts a sub-DataFrame (copy) of the same data, i.e. it has the same index and so can be updated correctly):

In [11]: d.select(lambda x: x[2] == 'sat') * 2
Out[11]:
                     sales
year flavour    day
2008 strawberry sat     20
     banana     sat     44
2009 strawberry sat     22
     banana     sat     46

In [12]: d.update(d.select(lambda x: x[2] == 'sat') * 2)

Another option is to use an apply:

In [21]: d.apply(lambda x: x*2 if x.name[2] == 'sat' else x, axis=1)

Another option is to use get_level_values (this is probably the most efficient way of these):

In [22]: d[d.index.get_level_values('day') == 'sat'] *= 2

Another option is promote the 'day' level to a column and then use an apply.

like image 66
Andy Hayden Avatar answered Nov 09 '22 07:11

Andy Hayden


Detailed MultiIndexing Explanation

You can use the .loc indexer to select subsets of data from a DataFrame with a MultiIndex. Assuming we have the data from the original question:

                     sales
year flavour    day       
2008 strawberry sat     10
                sun     12
     banana     sat     22
                sun     23
2009 strawberry sat     11
                sun     13
     banana     sat     23
                sun     24

This DataFrame has 3 levels in its index and each level has a name (year, flavour and day). The levels are also implicitly given integer locations beginning with 0 from the outside. So, the year level can be referenced as 0, flavour with 1, and day as 2.

Selecting from level 0 - the outermost level

Level 0 is the easiest level to make a selection with. For instance, if we wanted to select just the year 2008, we could do the following:

df.loc[2008]

                sales
flavour    day       
strawberry sat     10
           sun     12
banana     sat     22
           sun     23

This drops the outer index level. If you wanted to keep the outer level, you could pass your selection as a list (or a slice):

df.loc[[2008]]  # df.loc[2008:2008] gets the same result

                     sales
year flavour    day       
2008 strawberry sat     10
                sun     12
     banana     sat     22
                sun     23

Making selections from the other levels

Making selections from any level other than level 0 is more complicated. Let's begin by selecting a specific combination like the year 2008, banana and sat. To do this, you pass the combination as a tuple to .loc:

df.loc[(2008, 'banana', 'sat')]

sales    22
Name: (2008, banana, sat), dtype: int64

I always use parentheses like the above but Python will automatically interpret any comma-separated set of values as tuple so the following will get the same result:

df.loc[2008, 'banana', 'sat']

All levels were dropped and a Series returned. We can keep the levels by passing the tuple inside of a list:

df.loc[[(2008, 'banana', 'sat')]]

                  sales
year flavour day       
2008 banana  sat     22

Selecting multiple values from a particular level

The previous example made a single selection from each level. It's possible to use a list to contain all the values of the level you desire. For instance, if we wanted to select all rows with year 2008 and 2009, with banana flavour and on saturday and sunday, we could do the following:

df.loc[([2008, 2009], 'banana', ('sat','sun'))]

                  sales
year flavour day       
2008 banana  sat     22
             sun     23
2009 banana  sat     23
             sun     24

Again, you don't have to wrap the whole selection in paraentheses to denote a tuple and can simply do:

df.loc[[2008, 2009], 'banana', ('sat','sun')]

Selecting all values from a particular level.

You may instead want to select all values from a particular level. For instance, let's try to select all the years, all the flavours and just saturday. You might think the following would work:

df.loc[:, :, 'sat']

But, this is met with a 'too many indexer's IndexError. There are three different ways to select all values from a particular level.

  • df.loc[(slice(None), slice(None), 'sat'), :]
  • df.loc(axis=0)[:, :, 'sat']
  • df.loc[pd.IndexSlice[:, :, 'sat'], :]

All three yield the following:

                     sales
year flavour    day       
2008 strawberry sat     10
     banana     sat     22
2009 strawberry sat     11
     banana     sat     23
like image 9
Ted Petrou Avatar answered Nov 09 '22 09:11

Ted Petrou