What is multi-index in pandas Dataframe and groupby?

In this article, we will discuss Multi-index for Pandas Dataframe and Groupby operations. Multi-index allows you to select more than one row and column in your index. It is a multi-level or hierarchical object for pandas object.

How to select data from a Dataframe in pandas?

When it comes to select data on a DataFrame, Pandas loc is one of the top favorites. In a previous article, we have introduced the loc and iloc for selecting data in a general (single-index) DataFrame. Accessing data in a MultiIndex DataFrame can be done in a similar way to a single index DataFrame. We can also use : to return all data.

What is a multi-level index in pandas?

A multi-level, or hierarchical, index object for pandas objects. The unique labels for each level. Integers for each level designating which label at each location. Level of sortedness (must be lexicographically sorted by that level). Names for each of the index levels. (name is accepted for compat).

How do I Index a Dataframe with a multiindex?

When creating a DataFrame with a MultiIndex, make sure to append that to the end of the line of code like this: The Pandas documentation has this note on it: Indexing will work even if the data are not sorted, but will be rather inefficient (and show a PerformanceWarning). It will also return a copy of the data rather than a view.

selecting from multi-index pandas

Tags:

python

pandas

dataframe

multi-index

One way is to use the get_level_values Index method:

In [11]: df
Out[11]:
     0
A B
1 4  1
2 5  2
3 6  3

In [12]: df.iloc[df.index.get_level_values('A') == 1]
Out[12]:
     0
A B
1 4  1

In 0.13 you'll be able to use xs with drop_level argument:

df.xs(1, level='A', drop_level=False) # axis=1 if columns

Note: if this were column MultiIndex rather than index, you could use the same technique:

In [21]: df1 = df.T

In [22]: df1.iloc[:, df1.columns.get_level_values('A') == 1]
Out[22]:
A  1
B  4
0  1

You can also use query which is very readable in my opinion and straightforward to use:

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [10, 20, 50, 80], 'C': [6, 7, 8, 9]})
df = df.set_index(['A', 'B'])

      C
A B    
1 10  6
2 20  7
3 50  8
4 80  9

For what you had in mind you can now simply do:

df.query('A == 1')

      C
A B    
1 10  6

You can also have more complex queries using and

df.query('A >= 1 and B >= 50')

      C
A B    
3 50  8
4 80  9

and or

df.query('A == 1 or B >= 50')

      C
A B    
1 10  6
3 50  8
4 80  9

You can also query on different index levels, e.g.

df.query('A == 1 or C >= 8')

will return

If you want to use variables inside your query, you can use @:

b_threshold = 20
c_threshold = 8

df.query('B >= @b_threshold and C <= @c_threshold')

      C
A B    
2 20  7
3 50  8

You can use DataFrame.xs():

In [36]: df = DataFrame(np.random.randn(10, 4))

In [37]: df.columns = [np.random.choice(['a', 'b'], size=4).tolist(), np.random.choice(['c', 'd'], size=4)]

In [38]: df.columns.names = ['A', 'B']

In [39]: df
Out[39]:
A      b             a
B      d      d      d      d
0 -1.406  0.548 -0.635  0.576
1 -0.212 -0.583  1.012 -1.377
2  0.951 -0.349 -0.477 -1.230
3  0.451 -0.168  0.949  0.545
4 -0.362 -0.855  1.676 -2.881
5  1.283  1.027  0.085 -1.282
6  0.583 -1.406  0.327 -0.146
7 -0.518 -0.480  0.139  0.851
8 -0.030 -0.630 -1.534  0.534
9  0.246 -1.558 -1.885 -1.543

In [40]: df.xs('a', level='A', axis=1)
Out[40]:
B      d      d
0 -0.635  0.576
1  1.012 -1.377
2 -0.477 -1.230
3  0.949  0.545
4  1.676 -2.881
5  0.085 -1.282
6  0.327 -0.146
7  0.139  0.851
8 -1.534  0.534
9 -1.885 -1.543

If you want to keep the A level (the drop_level keyword argument is only available starting from v0.13.0):

In [42]: df.xs('a', level='A', axis=1, drop_level=False)
Out[42]:
A      a
B      d      d
0 -0.635  0.576
1  1.012 -1.377
2 -0.477 -1.230
3  0.949  0.545
4  1.676 -2.881
5  0.085 -1.282
6  0.327 -0.146
7  0.139  0.851
8 -1.534  0.534
9 -1.885 -1.543

Understanding how to access multi-indexed pandas DataFrame can help you with all kinds of task like that.

Copy paste this in your code to generate example:

# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                     names=['subject', 'type'])

# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37

# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data

Will give you table like this:

enter image description here

Standard access by column

health_data['Bob']
type       HR   Temp
year visit      
2013    1   22.0    38.6
        2   52.0    38.3
2014    1   30.0    38.9
        2   31.0    37.3


health_data['Bob']['HR']
year  visit
2013  1        22.0
      2        52.0
2014  1        30.0
      2        31.0
Name: HR, dtype: float64

# filtering by column/subcolumn - your case:
health_data['Bob']['HR']==22
year  visit
2013  1         True
      2        False
2014  1        False
      2        False

health_data['Bob']['HR'][2013]    
visit
1    22.0
2    52.0
Name: HR, dtype: float64

health_data['Bob']['HR'][2013][1]
22.0

Access by row

health_data.loc[2013]
subject Bob Guido   Sue
type    HR  Temp    HR  Temp    HR  Temp
visit                       
1   22.0    38.6    40.0    38.9    53.0    37.5
2   52.0    38.3    42.0    34.6    30.0    37.7

health_data.loc[2013,1] 
subject  type
Bob      HR      22.0
         Temp    38.6
Guido    HR      40.0
         Temp    38.9
Sue      HR      53.0
         Temp    37.5
Name: (2013, 1), dtype: float64

health_data.loc[2013,1]['Bob']
type
HR      22.0
Temp    38.6
Name: (2013, 1), dtype: float64

health_data.loc[2013,1]['Bob']['HR']
22.0

Slicing multi-index

idx=pd.IndexSlice
health_data.loc[idx[:,1], idx[:,'HR']]
    subject Bob Guido   Sue
type    HR  HR  HR
year    visit           
2013    1   22.0    40.0    53.0
2014    1   30.0    52.0    45.0

You can use DataFrame.loc:

>>> df.loc[1]

Example

>>> print(df)
       result
A B C        
1 1 1       6
    2       9
  2 1       8
    2      11
2 1 1       7
    2      10
  2 1       9
    2      12

>>> print(df.loc[1])
     result
B C        
1 1       6
  2       9
2 1       8
  2      11

>>> print(df.loc[2, 1])
   result
C        
1       7
2      10

Related questions
                            
                                How to convert int to Enum in python?
                            
                                How to use Sphinx's autodoc to document a class's __init__(self) method?
                            
                                Import CSV file as a pandas DataFrame
                            
                                Why is pow(a, d, n) so much faster than a**d % n?
                            
                                How to query database by id using SqlAlchemy?
                            
                                return string with first match Regex
                            
                                When should an attribute be private and made a read-only property? [closed]
                            
                                Python - use list as function parameters
                            
                                Format numbers to strings in Python
                            
                                How to install PyQt4 on Windows using pip?
                            
                                Why does random.shuffle return None?
                            
                                django templates: include and extends
                            
                                Identify groups of continuous numbers in a list
                            
                                How to identify numpy types in python?
                            
                                How to sort objects by multiple keys in Python?
                            
                                Django URL Redirect
                            
                                What are the pros and cons between get_dummies (Pandas) and OneHotEncoder (Scikit-learn)?
                            
                                How do I add custom field to Python log format string?
                            
                                python location on mac osx
                            
                                Getting TypeError: __init__() missing 1 required positional argument: 'on_delete' when trying to add parent table after child table with entries

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With