Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas how does IndexSlice work

Tags:

python

pandas

I am following this tutorial: GitHub Link

If you scroll down (Ctrl+F: Exercise: Select the most-reviewd beers ) to the section that says Exercise: Select the most-reviewd beers:

The dataframe is multindexed: enter image description here

To select the most-reviewed beers:

top_beers = df['beer_id'].value_counts().head(10).index
reviews.loc[pd.IndexSlice[:, top_beers], ['beer_name', 'beer_style']]

My question is the way of how the IndexSlice is used, how come you can skip the colon after top_beers and the code still run?

reviews.loc[pd.IndexSlice[:, top_beers, :], ['beer_name', 'beer_style']] 

There are three indexes, pofile_name, beed_id and time. Why does pd.IndexSlice[:, top_beers] work (without specify what to do with the time column)?

like image 381
Cheng Avatar asked May 20 '17 15:05

Cheng


People also ask

What does PD IndexSlice do?

In summary, pd. IndexSlice helps to improve readability when specifying slices for rows and column indices.

What are MultiIndex pandas?

The MultiIndex object is the hierarchical analogue of the standard Index object which typically stores the axis labels in pandas objects. You can think of MultiIndex as an array of tuples where each tuple is unique. A MultiIndex can be created from a list of arrays (using MultiIndex.


1 Answers

To complement the previous answer, let me explain how pd.IndexSlice works and why it is useful.

Well, there is not much to say about its implementation. As you read in the source, it just does the following:

class IndexSlice(object):
    def __getitem__(self, arg):
        return arg

From this we see that pd.IndexSlice only forwards the arguments that __getitem__ has received. Looks pretty stupid, doesn't it? However, it actually does something.

As you certainly know already, obj.__getitem__(arg) is called if you access an object obj through its bracket operator obj[arg]. For sequence-type objects, arg can be either an integer or a slice object. We rarely construct slices ourselves. Rather, we'd use the slice operator : (aka ellipsis) for this purpose, e.g. obj[0:5].

And here comes the point. The python interpretor converts these slice operators : into slice objects before calling the object's __getitem__(arg) method. Therefore, the return value of IndexSlice.__getItem__() will actually be a slice, an integer (if no : was used), or a tuple of these (if multiple arguments are passed). In summary, the only purpose of IndexSlice is that we don't have to construct the slices on our own. This behavior is particularly useful for pd.DataFrame.loc.

Let's first have a look at the following examples:

import pandas as pd
idx = pd.IndexSlice
print(idx[0])               # 0
print(idx[0,'a'])           # (0, 'a')
print(idx[:])               # slice(None, None, None)
print(idx[0:3])             # slice(0, 3, None)
print(idx[0.1:2.3])         # slice(0.1, 2.3, None)
print(idx[0:3,'a':'c'])     # (slice(0, 3, None), slice('a', 'c', None))

We observe that all usages of colons : are converted into slice object. If multiple arguments are passed to the index operator, the arguments are turned into n-tuples.

To demonstrate how this could be useful for a pandas data-frame df with a multi-level index, let's have a look at the following.

# A sample table with three-level row-index
# and single-level column index.
import numpy as np
level0 = range(0,10)
level1 = list('abcdef')
level2 = ['I', 'II', 'III', 'IV']
mi = pd.MultiIndex.from_product([level0, level1, level2])
df = pd.DataFrame(np.random.random([len(mi),2]), 
                  index=mi, columns=['col1', 'col2'])

# Return a view on 'col1', selecting all rows.
df.loc[:,'col1']            # pd.Series         

# Note: in the above example, the returned value has type
# pd.Series, because only one column is returned. One can 
# enforce the returned object to be a data-frame:
df.loc[:,['col1']]          # pd.DataFrame, or
df.loc[:,'col1'].to_frame() # 

# Select all rows with top-level values 0:3.
df.loc[0:3, 'col1']   

# If we want to create a slice for multiple index levels
# we need to pass somehow a list of slices. The following
# however leads to a SyntaxError because the slice 
# operator ':' cannot be placed inside a list declaration.
df.loc[[0:3, 'a':'c'], 'col1'] 

# The following is valid python code, but looks clumsy:
df.loc[(slice(0, 3, None), slice('a', 'c', None)), 'col1']

# Here is why pd.IndexSlice is useful. It helps
# to create a slice that makes use of two index-levels.
df.loc[idx[0:3, 'a':'c'], 'col1'] 

# We can expand the slice specification by a third level.
df.loc[idx[0:3, 'a':'c', 'I':'III'], 'col1'] 

# A solitary slicing operator ':' means: take them all.
# It is equivalent to slice(None).
df.loc[idx[0:3, 'a':'c', :], 'col1'] # pd.Series

# Semantically, this is equivalent to the following,
# because the last ':' in the previous example does 
# not add any information about the slice specification.
df.loc[idx[0:3, 'a':'c'], 'col1']    # pd.Series

# The following lines are also equivalent, but
# both expressions evaluate to a result with multiple columns.
df.loc[idx[0:3, 'a':'c', :], :]    # pd.DataFrame
df.loc[idx[0:3, 'a':'c'], :]       # pd.DataFrame

In summary, pd.IndexSlice helps to improve readability when specifying slices for rows and column indices.

What pandas then does with these slices is a different story. It essentially selects rows/columns, starting from the topmost index-level and reduces the selection when going further down the levels, depending on how many levels have been specified. pd.DataFrame.loc is an object with its own __getitem__() function that does all this.

As you pointed out already in one of your comments, pandas seemingly behaves weird in some special cases. The two examples you mentioned will actually evaluate to the same result. However, they are treated differently by pandas internally.

# This will work.
reviews.loc[idx[top_reviewers,        99, :], ['beer_name', 'brewer_id']]
# This will fail with TypeError "unhashable type: 'Index'".
reviews.loc[idx[top_reviewers,        99]   , ['beer_name', 'brewer_id']]
# This fixes the problem. (pd.Index is not hashable, a tuple is.
# However, the problem matters only with the second expression.)
reviews.loc[idx[tuple(top_reviewers), 99]   , ['beer_name', 'brewer_id']]

Admittedly, the difference is subtle.

like image 54
normanius Avatar answered Oct 31 '22 02:10

normanius