Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Benefits of panda's multiindex?

So I learned that I can use DataFrame.groupby without having a MultiIndex to do subsampling/cross-sections.

On the other hand, when I have a MultiIndex on a DataFrame, I still need to use DataFrame.groupby to do sub-sampling/cross-sections.

So what is a MultiIndex good for apart from the quite helpful and pretty display of the hierarchies when printing?

like image 975
K.-Michael Aye Avatar asked Nov 05 '12 04:11

K.-Michael Aye


People also ask

What is Panda MultiIndex?

The MultiIndex object is the hierarchical analogue of the standard Index object which typically stores the axis labels in pandas objects. You can think of MultiIndex as an array of tuples where each tuple is unique. A MultiIndex can be created from a list of arrays (using MultiIndex.

Why are pandas indexes important?

An index on a Pandas DataFrame gives us a way to identify rows. Identifying rows by a “label” is arguably better than identifying a row by number. If you only have the integer position to work with, you have to remember the number for each row.

How many levels are in MultiIndex pandas?

As we can see in the output, midx MultiIndex has 3 levels.


2 Answers

Hierarchical indexing (also referred to as “multi-level” indexing) was introduced in the pandas 0.4 release.

This opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to effectively store and manipulate arbitrarily high dimension data in a 2-dimensional tabular structure (DataFrame), for example.

Imagine constructing a dataframe using MultiIndex like this:-

import pandas as pd import numpy as np  np.arrays = [['one','one','one','two','two','two'],[1,2,3,1,2,3]]  df = pd.DataFrame(np.random.randn(6,2),index=pd.MultiIndex.from_tuples(list(zip(*np.arrays))),columns=['A','B'])  df  # This is the dataframe we have generated            A         B one 1 -0.732470 -0.313871     2 -0.031109 -2.068794     3  1.520652  0.471764 two 1 -0.101713 -1.204458     2  0.958008 -0.455419     3 -0.191702 -0.915983 

This df is simply a data structure of two dimensions

df.ndim  2 

But we can imagine it, looking at the output, as a 3 dimensional data structure.

  • one with 1 with data -0.732470 -0.313871.
  • one with 2 with data -0.031109 -2.068794.
  • one with 3 with data 1.520652 0.471764.

A.k.a.: "effectively store and manipulate arbitrarily high dimension data in a 2-dimensional tabular structure"

This is not just a "pretty display". It has the benefit of easy retrieval of data since we now have a hierarchal index.

For example.

In [44]: df.ix["one"] Out[44]:            A         B 1 -0.732470 -0.313871 2 -0.031109 -2.068794 3  1.520652  0.471764 

will give us a new data frame only for the group of data belonging to "one".

And we can narrow down our data selection further by doing this:-

In [45]: df.ix["one"].ix[1] Out[45]:  A   -0.732470 B   -0.313871 Name: 1 

And of course, if we want a specific value, here's an example:-

In [46]: df.ix["one"].ix[1]["A"] Out[46]: -0.73247029752040727 

So if we have even more indexes (besides the 2 indexes shown in the example above), we can essentially drill down and select the data set we are really interested in without a need for groupby.

We can even grab a cross-section (either rows or columns) from our dataframe...

By rows:-

In [47]: df.xs('one') Out[47]:            A         B 1 -0.732470 -0.313871 2 -0.031109 -2.068794 3  1.520652  0.471764 

By columns:-

In [48]: df.xs('B', axis=1) Out[48]:  one  1   -0.313871      2   -2.068794      3    0.471764 two  1   -1.204458      2   -0.455419      3   -0.915983 Name: B 
like image 107
Calvin Cheng Avatar answered Oct 09 '22 23:10

Calvin Cheng


Great post by @Calvin Cheng, but thought I'd take a stab at this as well.

When to use a MultiIndex:

  1. When a single column’s value isn’t enough to uniquely identify a row.
  2. When data is logically hierarchical - meaning that it has multiple dimensions or “levels.”

Why (your core question) - at least these are the biggest benefits IMO:

  1. Easy manipulation via stack() and unstack()
  2. Easy math when there are multiple column levels
  3. Syntactic sugar for slicing/filtering

Example:

                                                       Dollars  Units Date       Store   Category Subcategory UPC EAN 2018-07-10 Store 1 Alcohol  Liqour      80480280024    154.77      7            Store 2 Alcohol  Liqour      80480280024     82.08      4            Store 3 Alcohol  Liqour      80480280024    259.38      9            Store 1 Alcohol  Liquor      80432400630    477.68     14                                         674545000001   139.68      4            Store 2 Alcohol  Liquor      80432400630    203.88      6                                         674545000001   377.13     13            Store 3 Alcohol  Liquor      80432400630    239.19      7                                         674545000001   432.32     14            Store 1 Beer     Ales        94922755711     65.17      7                                         702770082018   174.44     14                                         736920111112    50.70      5            Store 2 Beer     Ales        94922755711    129.60     12                                         702770082018   107.40     10                                         736920111112    59.65      5            Store 3 Beer     Ales        94922755711    154.00     14                                         702770082018   137.40     10                                         736920111112   107.88     12            Store 1 Beer     Lagers      702770081011   156.24     12            Store 2 Beer     Lagers      702770081011   137.06     11            Store 3 Beer     Lagers      702770081011   119.52      8     

1) If we want to easily compare sales across stores, we can use df.unstack('Store') to line everything up side-by-side:

                                             Dollars                   Units Store                                        Store 1 Store 2 Store 3 Store 1 Store 2 Store 3 Date       Category Subcategory UPC EAN 2018-07-10 Alcohol  Liqour      80480280024   154.77   82.08  259.38       7       4       9                     Liquor      80432400630   477.68  203.88  239.19      14       6       7                                 674545000001  139.68  377.13  432.32       4      13      14            Beer     Ales        94922755711    65.17  129.60  154.00       7      12      14                                 702770082018  174.44  107.40  137.40      14      10      10                                 736920111112   50.70   59.65  107.88       5       5      12                     Lagers      702770081011  156.24  137.06  119.52      12      11       8 

2) We can also easily do math on multiple columns. For example, df['Dollars'] / df['Units'] will then divide each store's dollars by its units, for every store without multiple operations:

Store                                         Store 1  Store 2  Store 3 Date       Category Subcategory UPC EAN 2018-07-10 Alcohol  Liqour      80480280024     22.11    20.52    28.82                     Liquor      80432400630     34.12    33.98    34.17                                 674545000001    34.92    29.01    30.88            Beer     Ales        94922755711      9.31    10.80    11.00                                 702770082018    12.46    10.74    13.74                                 736920111112    10.14    11.93     8.99                     Lagers      702770081011    13.02    12.46    14.94 

3) If we then want to filter to just specific rows, instead of using the

df[(df[col1] == val1) and (df[col2] == val2) and (df[col3] == val3)] 

format, we can instead .xs or .query (yes these work for regular dfs, but it's not very useful). The syntax would instead be:

df.xs((val1, val2, val3), level=(col1, col2, col3)) 

More examples can be found in this tutorial notebook I put together.

like image 36
ZaxR Avatar answered Oct 09 '22 22:10

ZaxR