Pandas how does IndexSlice work

Tags:

pandas

I am following this tutorial: GitHub Link

If you scroll down (Ctrl+F: Exercise: Select the most-reviewd beers ) to the section that says Exercise: Select the most-reviewd beers:

The dataframe is multindexed: enter image description here

To select the most-reviewed beers:

top_beers = df['beer_id'].value_counts().head(10).index
reviews.loc[pd.IndexSlice[:, top_beers], ['beer_name', 'beer_style']]

My question is the way of how the IndexSlice is used, how come you can skip the colon after top_beers and the code still run?

reviews.loc[pd.IndexSlice[:, top_beers, :], ['beer_name', 'beer_style']]

There are three indexes, pofile_name, beed_id and time. Why does pd.IndexSlice[:, top_beers] work (without specify what to do with the time column)?

381

asked May 20 '17 15:05

1 Answers

To complement the previous answer, let me explain how pd.IndexSlice works and why it is useful.

Well, there is not much to say about its implementation. As you read in the source, it just does the following:

class IndexSlice(object):
    def __getitem__(self, arg):
        return arg

From this we see that pd.IndexSlice only forwards the arguments that __getitem__ has received. Looks pretty stupid, doesn't it? However, it actually does something.

As you certainly know already, obj.__getitem__(arg) is called if you access an object obj through its bracket operator obj[arg]. For sequence-type objects, arg can be either an integer or a slice object. We rarely construct slices ourselves. Rather, we'd use the slice operator : (aka ellipsis) for this purpose, e.g. obj[0:5].

And here comes the point. The python interpretor converts these slice operators : into slice objects before calling the object's __getitem__(arg) method. Therefore, the return value of IndexSlice.__getItem__() will actually be a slice, an integer (if no : was used), or a tuple of these (if multiple arguments are passed). In summary, the only purpose of IndexSlice is that we don't have to construct the slices on our own. This behavior is particularly useful for pd.DataFrame.loc.

Let's first have a look at the following examples:

import pandas as pd
idx = pd.IndexSlice
print(idx[0])               # 0
print(idx[0,'a'])           # (0, 'a')
print(idx[:])               # slice(None, None, None)
print(idx[0:3])             # slice(0, 3, None)
print(idx[0.1:2.3])         # slice(0.1, 2.3, None)
print(idx[0:3,'a':'c'])     # (slice(0, 3, None), slice('a', 'c', None))

We observe that all usages of colons : are converted into slice object. If multiple arguments are passed to the index operator, the arguments are turned into n-tuples.

To demonstrate how this could be useful for a pandas data-frame df with a multi-level index, let's have a look at the following.

# A sample table with three-level row-index
# and single-level column index.
import numpy as np
level0 = range(0,10)
level1 = list('abcdef')
level2 = ['I', 'II', 'III', 'IV']
mi = pd.MultiIndex.from_product([level0, level1, level2])
df = pd.DataFrame(np.random.random([len(mi),2]), 
                  index=mi, columns=['col1', 'col2'])

# Return a view on 'col1', selecting all rows.
df.loc[:,'col1']            # pd.Series         

# Note: in the above example, the returned value has type
# pd.Series, because only one column is returned. One can 
# enforce the returned object to be a data-frame:
df.loc[:,['col1']]          # pd.DataFrame, or
df.loc[:,'col1'].to_frame() # 

# Select all rows with top-level values 0:3.
df.loc[0:3, 'col1']   

# If we want to create a slice for multiple index levels
# we need to pass somehow a list of slices. The following
# however leads to a SyntaxError because the slice 
# operator ':' cannot be placed inside a list declaration.
df.loc[[0:3, 'a':'c'], 'col1'] 

# The following is valid python code, but looks clumsy:
df.loc[(slice(0, 3, None), slice('a', 'c', None)), 'col1']

# Here is why pd.IndexSlice is useful. It helps
# to create a slice that makes use of two index-levels.
df.loc[idx[0:3, 'a':'c'], 'col1'] 

# We can expand the slice specification by a third level.
df.loc[idx[0:3, 'a':'c', 'I':'III'], 'col1'] 

# A solitary slicing operator ':' means: take them all.
# It is equivalent to slice(None).
df.loc[idx[0:3, 'a':'c', :], 'col1'] # pd.Series

# Semantically, this is equivalent to the following,
# because the last ':' in the previous example does 
# not add any information about the slice specification.
df.loc[idx[0:3, 'a':'c'], 'col1']    # pd.Series

# The following lines are also equivalent, but
# both expressions evaluate to a result with multiple columns.
df.loc[idx[0:3, 'a':'c', :], :]    # pd.DataFrame
df.loc[idx[0:3, 'a':'c'], :]       # pd.DataFrame

In summary, pd.IndexSlice helps to improve readability when specifying slices for rows and column indices.

What pandas then does with these slices is a different story. It essentially selects rows/columns, starting from the topmost index-level and reduces the selection when going further down the levels, depending on how many levels have been specified. pd.DataFrame.loc is an object with its own __getitem__() function that does all this.

As you pointed out already in one of your comments, pandas seemingly behaves weird in some special cases. The two examples you mentioned will actually evaluate to the same result. However, they are treated differently by pandas internally.

# This will work.
reviews.loc[idx[top_reviewers,        99, :], ['beer_name', 'brewer_id']]
# This will fail with TypeError "unhashable type: 'Index'".
reviews.loc[idx[top_reviewers,        99]   , ['beer_name', 'brewer_id']]
# This fixes the problem. (pd.Index is not hashable, a tuple is.
# However, the problem matters only with the second expression.)
reviews.loc[idx[tuple(top_reviewers), 99]   , ['beer_name', 'brewer_id']]

Admittedly, the difference is subtle.

answered Oct 31 '22 02:10

normanius

Related questions
                            
                                How to create a dictionary with new KEY with data from list? [duplicate]
                            
                                Python custom module name not defined
                            
                                what is the shortcut for "replace all" in pycharm
                            
                                Convert JSON list of integers to String in Django Rest Framework
                            
                                Charting Candlestick_OHLC one minute bars with Pandas and Matplotlib
                            
                                Install scipy for both python 2 and python 3
                            
                                python - replace the boolean value of a list with the values from two different lists [duplicate]
                            
                                Get Type in Robot Framework
                            
                                Regex split string by last occurrence of pattern
                            
                                Convert array of integers into dictionary of indices
                            
                                Django - authenticate() A user with that username already exists
                            
                                Python - Create a List Starting at a Given Value and End at Given Length
                            
                                insert item to list without insert() or append() Python
                            
                                S3: ExpiredToken error for S3 pre-signed url within expiry period
                            
                                NameError: name 'tree' is not defined
                            
                                "No module named scipy" on Windows
                            
                                TypeError: '<' not supported between instances of 'State' and 'State' PYTHON 3
                            
                                Warning using Scipy with Pandas
                            
                                Least square method in python?
                            
                                How to send OpenCV video footage over ZeroMQ sockets?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas how does IndexSlice work

Tags:

python

pandas

Cheng

People also ask

1 Answers

normanius

Recent Activity

Donate For Us