So I learned that I can use DataFrame.groupby without having a MultiIndex to do subsampling/cross-sections. On the other hand, when I have a MultiIndex on a DataFrame, I still need to use DataFrame.groupby to do sub-sampling/cross-sections. So what is a MultiIndex good for apart from the quite helpful and pretty display of the hierarchies when printing?

Hierarchical indexing (also referred to as “multi-level” indexing) was introduced in the pandas 0.4 release. This opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to effectively store and manipulate arbitrarily high dimension data in a 2-dimensional tabular structure (DataFrame), for example. Imagine constructing a dataframe using <code>MultiIndex</code> like this:- <pre class="prettyprint"><code>import pandas as pd import numpy as np np.arrays = [['one','one','one','two','two','two'],[1,2,3,1,2,3]] df = pd.DataFrame(np.random.randn(6,2),index=pd.MultiIndex.from_tuples(list(zip(*np.arrays))),columns=['A','B']) df # This is the dataframe we have generated A B one 1 -0.732470 -0.313871 2 -0.031109 -2.068794 3 1.520652 0.471764 two 1 -0.101713 -1.204458 2 0.958008 -0.455419 3 -0.191702 -0.915983 </code></pre> This <code>df</code> is simply a data structure of two dimensions <pre class="prettyprint"><code>df.ndim 2 </code></pre> But we can imagine it, looking at the output, as a 3 dimensional data structure. <ul> <li> <code>one</code> with <code>1</code> with data <code>-0.732470 -0.313871</code>. </li> <li> <code>one</code> with <code>2</code> with data <code>-0.031109 -2.068794</code>. </li> <li> <code>one</code> with <code>3</code> with data <code>1.520652 0.471764</code>.</li> </ul> A.k.a.: "effectively store and manipulate arbitrarily high dimension data in a 2-dimensional tabular structure" This is not just a "pretty display". It has the benefit of easy retrieval of data since we now have a hierarchal index. For example. <pre class="prettyprint"><code>In [44]: df.ix["one"] Out[44]: A B 1 -0.732470 -0.313871 2 -0.031109 -2.068794 3 1.520652 0.471764 </code></pre> will give us a new data frame only for the group of data belonging to "one". And we can narrow down our data selection further by doing this:- <pre class="prettyprint"><code>In [45]: df.ix["one"].ix[1] Out[45]: A -0.732470 B -0.313871 Name: 1 </code></pre> And of course, if we want a specific value, here's an example:- <pre class="prettyprint"><code>In [46]: df.ix["one"].ix[1]["A"] Out[46]: -0.73247029752040727 </code></pre> So if we have even more indexes (besides the 2 indexes shown in the example above), we can essentially drill down and select the data set we are really interested in without a need for <code>groupby</code>. We can even grab a cross-section (either rows or columns) from our dataframe... By rows:- <pre class="prettyprint"><code>In [47]: df.xs('one') Out[47]: A B 1 -0.732470 -0.313871 2 -0.031109 -2.068794 3 1.520652 0.471764 </code></pre> By columns:- <pre class="prettyprint"><code>In [48]: df.xs('B', axis=1) Out[48]: one 1 -0.313871 2 -2.068794 3 0.471764 two 1 -1.204458 2 -0.455419 3 -0.915983 Name: B </code></pre>

Great post by @Calvin Cheng, but thought I'd take a stab at this as well. When to use a MultiIndex: <ol> <li>When a single column’s value isn’t enough to uniquely identify a row.</li> <li>When data is logically hierarchical - meaning that it has multiple dimensions or “levels.”</li> </ol> Why (your core question) - at least these are the biggest benefits IMO: <ol> <li>Easy manipulation via stack() and unstack()</li> <li>Easy math when there are multiple column levels</li> <li>Syntactic sugar for slicing/filtering</li> </ol> Example: <pre class="prettyprint"><code> Dollars Units Date Store Category Subcategory UPC EAN 2018-07-10 Store 1 Alcohol Liqour 80480280024 154.77 7 Store 2 Alcohol Liqour 80480280024 82.08 4 Store 3 Alcohol Liqour 80480280024 259.38 9 Store 1 Alcohol Liquor 80432400630 477.68 14 674545000001 139.68 4 Store 2 Alcohol Liquor 80432400630 203.88 6 674545000001 377.13 13 Store 3 Alcohol Liquor 80432400630 239.19 7 674545000001 432.32 14 Store 1 Beer Ales 94922755711 65.17 7 702770082018 174.44 14 736920111112 50.70 5 Store 2 Beer Ales 94922755711 129.60 12 702770082018 107.40 10 736920111112 59.65 5 Store 3 Beer Ales 94922755711 154.00 14 702770082018 137.40 10 736920111112 107.88 12 Store 1 Beer Lagers 702770081011 156.24 12 Store 2 Beer Lagers 702770081011 137.06 11 Store 3 Beer Lagers 702770081011 119.52 8 </code></pre> 1) If we want to easily compare sales across stores, we can use <code>df.unstack('Store')</code> to line everything up side-by-side: <pre class="prettyprint"><code> Dollars Units Store Store 1 Store 2 Store 3 Store 1 Store 2 Store 3 Date Category Subcategory UPC EAN 2018-07-10 Alcohol Liqour 80480280024 154.77 82.08 259.38 7 4 9 Liquor 80432400630 477.68 203.88 239.19 14 6 7 674545000001 139.68 377.13 432.32 4 13 14 Beer Ales 94922755711 65.17 129.60 154.00 7 12 14 702770082018 174.44 107.40 137.40 14 10 10 736920111112 50.70 59.65 107.88 5 5 12 Lagers 702770081011 156.24 137.06 119.52 12 11 8 </code></pre> 2) We can also easily do math on multiple columns. For example, <code>df['Dollars'] / df['Units']</code> will then divide each store's dollars by its units, for every store without multiple operations: <pre class="prettyprint"><code>Store Store 1 Store 2 Store 3 Date Category Subcategory UPC EAN 2018-07-10 Alcohol Liqour 80480280024 22.11 20.52 28.82 Liquor 80432400630 34.12 33.98 34.17 674545000001 34.92 29.01 30.88 Beer Ales 94922755711 9.31 10.80 11.00 702770082018 12.46 10.74 13.74 736920111112 10.14 11.93 8.99 Lagers 702770081011 13.02 12.46 14.94 </code></pre> 3) If we then want to filter to just specific rows, instead of using the <pre class="prettyprint"><code>df[(df[col1] == val1) and (df[col2] == val2) and (df[col3] == val3)] </code></pre> format, we can instead .xs or .query (yes these work for regular dfs, but it's not very useful). The syntax would instead be: <pre class="prettyprint"><code>df.xs((val1, val2, val3), level=(col1, col2, col3)) </code></pre> More examples can be found in this tutorial notebook I put together.

Benefits of panda's multiindex?

2 Answers

Hierarchical indexing (also referred to as “multi-level” indexing) was introduced in the pandas 0.4 release.

This opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to effectively store and manipulate arbitrarily high dimension data in a 2-dimensional tabular structure (DataFrame), for example.

Imagine constructing a dataframe using MultiIndex like this:-

import pandas as pd import numpy as np  np.arrays = [['one','one','one','two','two','two'],[1,2,3,1,2,3]]  df = pd.DataFrame(np.random.randn(6,2),index=pd.MultiIndex.from_tuples(list(zip(*np.arrays))),columns=['A','B'])  df  # This is the dataframe we have generated            A         B one 1 -0.732470 -0.313871     2 -0.031109 -2.068794     3  1.520652  0.471764 two 1 -0.101713 -1.204458     2  0.958008 -0.455419     3 -0.191702 -0.915983

This df is simply a data structure of two dimensions

df.ndim  2

But we can imagine it, looking at the output, as a 3 dimensional data structure.

one with 1 with data -0.732470 -0.313871.
one with 2 with data -0.031109 -2.068794.
one with 3 with data 1.520652 0.471764.

A.k.a.: "effectively store and manipulate arbitrarily high dimension data in a 2-dimensional tabular structure"

This is not just a "pretty display". It has the benefit of easy retrieval of data since we now have a hierarchal index.

For example.

In [44]: df.ix["one"] Out[44]:            A         B 1 -0.732470 -0.313871 2 -0.031109 -2.068794 3  1.520652  0.471764

will give us a new data frame only for the group of data belonging to "one".

And we can narrow down our data selection further by doing this:-

In [45]: df.ix["one"].ix[1] Out[45]:  A   -0.732470 B   -0.313871 Name: 1

And of course, if we want a specific value, here's an example:-

In [46]: df.ix["one"].ix[1]["A"] Out[46]: -0.73247029752040727

So if we have even more indexes (besides the 2 indexes shown in the example above), we can essentially drill down and select the data set we are really interested in without a need for groupby.

We can even grab a cross-section (either rows or columns) from our dataframe...

By rows:-

In [47]: df.xs('one') Out[47]:            A         B 1 -0.732470 -0.313871 2 -0.031109 -2.068794 3  1.520652  0.471764

By columns:-

In [48]: df.xs('B', axis=1) Out[48]:  one  1   -0.313871      2   -2.068794      3    0.471764 two  1   -1.204458      2   -0.455419      3   -0.915983 Name: B

107

answered Oct 09 '22 23:10

Calvin Cheng

Great post by @Calvin Cheng, but thought I'd take a stab at this as well.

When to use a MultiIndex:

When a single column’s value isn’t enough to uniquely identify a row.
When data is logically hierarchical - meaning that it has multiple dimensions or “levels.”

Why (your core question) - at least these are the biggest benefits IMO:

Easy manipulation via stack() and unstack()
Easy math when there are multiple column levels
Syntactic sugar for slicing/filtering

Example:

                                                       Dollars  Units Date       Store   Category Subcategory UPC EAN 2018-07-10 Store 1 Alcohol  Liqour      80480280024    154.77      7            Store 2 Alcohol  Liqour      80480280024     82.08      4            Store 3 Alcohol  Liqour      80480280024    259.38      9            Store 1 Alcohol  Liquor      80432400630    477.68     14                                         674545000001   139.68      4            Store 2 Alcohol  Liquor      80432400630    203.88      6                                         674545000001   377.13     13            Store 3 Alcohol  Liquor      80432400630    239.19      7                                         674545000001   432.32     14            Store 1 Beer     Ales        94922755711     65.17      7                                         702770082018   174.44     14                                         736920111112    50.70      5            Store 2 Beer     Ales        94922755711    129.60     12                                         702770082018   107.40     10                                         736920111112    59.65      5            Store 3 Beer     Ales        94922755711    154.00     14                                         702770082018   137.40     10                                         736920111112   107.88     12            Store 1 Beer     Lagers      702770081011   156.24     12            Store 2 Beer     Lagers      702770081011   137.06     11            Store 3 Beer     Lagers      702770081011   119.52      8

1) If we want to easily compare sales across stores, we can use df.unstack('Store') to line everything up side-by-side:

                                             Dollars                   Units Store                                        Store 1 Store 2 Store 3 Store 1 Store 2 Store 3 Date       Category Subcategory UPC EAN 2018-07-10 Alcohol  Liqour      80480280024   154.77   82.08  259.38       7       4       9                     Liquor      80432400630   477.68  203.88  239.19      14       6       7                                 674545000001  139.68  377.13  432.32       4      13      14            Beer     Ales        94922755711    65.17  129.60  154.00       7      12      14                                 702770082018  174.44  107.40  137.40      14      10      10                                 736920111112   50.70   59.65  107.88       5       5      12                     Lagers      702770081011  156.24  137.06  119.52      12      11       8

2) We can also easily do math on multiple columns. For example, df['Dollars'] / df['Units'] will then divide each store's dollars by its units, for every store without multiple operations:

Store                                         Store 1  Store 2  Store 3 Date       Category Subcategory UPC EAN 2018-07-10 Alcohol  Liqour      80480280024     22.11    20.52    28.82                     Liquor      80432400630     34.12    33.98    34.17                                 674545000001    34.92    29.01    30.88            Beer     Ales        94922755711      9.31    10.80    11.00                                 702770082018    12.46    10.74    13.74                                 736920111112    10.14    11.93     8.99                     Lagers      702770081011    13.02    12.46    14.94

3) If we then want to filter to just specific rows, instead of using the

df[(df[col1] == val1) and (df[col2] == val2) and (df[col3] == val3)]

format, we can instead .xs or .query (yes these work for regular dfs, but it's not very useful). The syntax would instead be:

df.xs((val1, val2, val3), level=(col1, col2, col3))

More examples can be found in this tutorial notebook I put together.

answered Oct 09 '22 22:10

ZaxR

Related questions
                            
                                Google App Engine and 404 error
                            
                                Using Sphinx to write personal websites and blogs
                            
                                Correct way to emulate single precision floating point in python?
                            
                                How to fetch only specific columns of a table in django? [duplicate]
                            
                                Python list function argument names [duplicate]
                            
                                Using python's logging module to log all exceptions and errors
                            
                                How to return all the minimum indices in numpy
                            
                                how to get which statements are missed in python test coverage
                            
                                ValueError: Can not squeeze dim[1], expected a dimension of 1, got 3 for 'sparse_softmax_cross_entropy_loss
                            
                                Using Cython with Django. Does it make sense?
                            
                                How to run PyCharm in Ubuntu - "Run in Terminal" or "Run"?
                            
                                Python 3 UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d
                            
                                Sampling uniformly distributed random points inside a spherical volume
                            
                                How to get list of all variables in jinja 2 templates
                            
                                Execute .sql schema in psycopg2 in Python
                            
                                What's the best way to split a string into fixed length chunks and work with them in Python?
                            
                                Seaborn RegPlot Partially See Through (alpha)
                            
                                Anaconda Installed but Cannot Launch Navigator
                            
                                Finding max value in the second column of a nested list?
                            
                                Reverse a string in Python two characters at a time (Network byte order)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Benefits of panda's multiindex?

Tags:

python

pandas

multi-index

K.-Michael Aye

People also ask

2 Answers

Calvin Cheng

ZaxR

Recent Activity

Donate For Us