I'm using a MultiIndexed pandas DataFrame and would like to multiply a subset of the DataFrame by a certain number.
It's the same as this but with a MultiIndex.
>>> d = pd.DataFrame({'year':[2008,2008,2008,2008,2009,2009,2009,2009],
'flavour':['strawberry','strawberry','banana','banana',
'strawberry','strawberry','banana','banana'],
'day':['sat','sun','sat','sun','sat','sun','sat','sun'],
'sales':[10,12,22,23,11,13,23,24]})
>>> d = d.set_index(['year','flavour','day'])
>>> d
sales
year flavour day
2008 strawberry sat 10
sun 12
banana sat 22
sun 23
2009 strawberry sat 11
sun 13
banana sat 23
sun 24
So far, so good. But let's say I spot that all the Saturday figures are only half what they should be! I'd like to multiply all sat
sales by 2.
My first attempt at this was:
sat = d.xs('sat', level='day')
sat = sat * 2
d.update(sat)
but this doesn't work because the variable sat
has lost the day
level of the index:
>>> sat
sales
year flavour
2008 strawberry 20
banana 44
2009 strawberry 22
banana 46
so pandas doesn't know how to join the new sales figures back onto the old dataframe.
I had a quick stab at:
>>> sat = d.xs('sat', level='day', copy=False)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2248, in xs
raise ValueError('Cannot retrieve view (copy=False)')
ValueError: Cannot retrieve view (copy=False)
I have no idea what that error means, but I feel like I'm making a mountain out of a molehill. Does anyone know the right way to do this?
Thanks in advance, Rob
Note: In soon to be released 0.13 a drop_level
argument has been added to xs (thanks to this question!):
In [42]: df.xs('sat', level='day', drop_level=False)
Out[42]:
sales
year flavour day
2008 strawberry sat 10
Another option is to use select (which extracts a sub-DataFrame (copy) of the same data, i.e. it has the same index and so can be updated correctly):
In [11]: d.select(lambda x: x[2] == 'sat') * 2
Out[11]:
sales
year flavour day
2008 strawberry sat 20
banana sat 44
2009 strawberry sat 22
banana sat 46
In [12]: d.update(d.select(lambda x: x[2] == 'sat') * 2)
Another option is to use an apply:
In [21]: d.apply(lambda x: x*2 if x.name[2] == 'sat' else x, axis=1)
Another option is to use get_level_values
(this is probably the most efficient way of these):
In [22]: d[d.index.get_level_values('day') == 'sat'] *= 2
Another option is promote the 'day' level to a column and then use an apply.
You can use the .loc
indexer to select subsets of data from a DataFrame with a MultiIndex. Assuming we have the data from the original question:
sales
year flavour day
2008 strawberry sat 10
sun 12
banana sat 22
sun 23
2009 strawberry sat 11
sun 13
banana sat 23
sun 24
This DataFrame has 3 levels in its index and each level has a name (year
, flavour
and day
). The levels are also implicitly given integer locations beginning with 0 from the outside. So, the year
level can be referenced as 0
, flavour
with 1
, and day
as 2
.
Level 0
is the easiest level to make a selection with. For instance, if we wanted to select just the year 2008, we could do the following:
df.loc[2008]
sales
flavour day
strawberry sat 10
sun 12
banana sat 22
sun 23
This drops the outer index level. If you wanted to keep the outer level, you could pass your selection as a list (or a slice):
df.loc[[2008]] # df.loc[2008:2008] gets the same result
sales
year flavour day
2008 strawberry sat 10
sun 12
banana sat 22
sun 23
Making selections from any level other than level 0 is more complicated. Let's begin by selecting a specific combination like the year 2008
, banana
and sat
. To do this, you pass the combination as a tuple to .loc
:
df.loc[(2008, 'banana', 'sat')]
sales 22
Name: (2008, banana, sat), dtype: int64
I always use parentheses like the above but Python will automatically interpret any comma-separated set of values as tuple so the following will get the same result:
df.loc[2008, 'banana', 'sat']
All levels were dropped and a Series returned. We can keep the levels by passing the tuple inside of a list:
df.loc[[(2008, 'banana', 'sat')]]
sales
year flavour day
2008 banana sat 22
The previous example made a single selection from each level. It's possible to use a list to contain all the values of the level you desire. For instance, if we wanted to select all rows with year 2008 and 2009, with banana flavour and on saturday and sunday, we could do the following:
df.loc[([2008, 2009], 'banana', ('sat','sun'))]
sales
year flavour day
2008 banana sat 22
sun 23
2009 banana sat 23
sun 24
Again, you don't have to wrap the whole selection in paraentheses to denote a tuple and can simply do:
df.loc[[2008, 2009], 'banana', ('sat','sun')]
You may instead want to select all values from a particular level. For instance, let's try to select all the years, all the flavours and just saturday. You might think the following would work:
df.loc[:, :, 'sat']
But, this is met with a 'too many indexer's IndexError. There are three different ways to select all values from a particular level.
df.loc[(slice(None), slice(None), 'sat'), :]
df.loc(axis=0)[:, :, 'sat']
df.loc[pd.IndexSlice[:, :, 'sat'], :]
All three yield the following:
sales
year flavour day
2008 strawberry sat 10
banana sat 22
2009 strawberry sat 11
banana sat 23
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With