So I learned that I can use DataFrame.groupby without having a MultiIndex to do subsampling/cross-sections.
On the other hand, when I have a MultiIndex on a DataFrame, I still need to use DataFrame.groupby to do sub-sampling/cross-sections.
So what is a MultiIndex good for apart from the quite helpful and pretty display of the hierarchies when printing?
The MultiIndex object is the hierarchical analogue of the standard Index object which typically stores the axis labels in pandas objects. You can think of MultiIndex as an array of tuples where each tuple is unique. A MultiIndex can be created from a list of arrays (using MultiIndex.
An index on a Pandas DataFrame gives us a way to identify rows. Identifying rows by a “label” is arguably better than identifying a row by number. If you only have the integer position to work with, you have to remember the number for each row.
As we can see in the output, midx MultiIndex has 3 levels.
Hierarchical indexing (also referred to as “multi-level” indexing) was introduced in the pandas 0.4 release.
This opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to effectively store and manipulate arbitrarily high dimension data in a 2-dimensional tabular structure (DataFrame), for example.
Imagine constructing a dataframe using MultiIndex
like this:-
import pandas as pd import numpy as np np.arrays = [['one','one','one','two','two','two'],[1,2,3,1,2,3]] df = pd.DataFrame(np.random.randn(6,2),index=pd.MultiIndex.from_tuples(list(zip(*np.arrays))),columns=['A','B']) df # This is the dataframe we have generated A B one 1 -0.732470 -0.313871 2 -0.031109 -2.068794 3 1.520652 0.471764 two 1 -0.101713 -1.204458 2 0.958008 -0.455419 3 -0.191702 -0.915983
This df
is simply a data structure of two dimensions
df.ndim 2
But we can imagine it, looking at the output, as a 3 dimensional data structure.
one
with 1
with data -0.732470 -0.313871
. one
with 2
with data -0.031109 -2.068794
. one
with 3
with data 1.520652 0.471764
.A.k.a.: "effectively store and manipulate arbitrarily high dimension data in a 2-dimensional tabular structure"
This is not just a "pretty display". It has the benefit of easy retrieval of data since we now have a hierarchal index.
For example.
In [44]: df.ix["one"] Out[44]: A B 1 -0.732470 -0.313871 2 -0.031109 -2.068794 3 1.520652 0.471764
will give us a new data frame only for the group of data belonging to "one".
And we can narrow down our data selection further by doing this:-
In [45]: df.ix["one"].ix[1] Out[45]: A -0.732470 B -0.313871 Name: 1
And of course, if we want a specific value, here's an example:-
In [46]: df.ix["one"].ix[1]["A"] Out[46]: -0.73247029752040727
So if we have even more indexes (besides the 2 indexes shown in the example above), we can essentially drill down and select the data set we are really interested in without a need for groupby
.
We can even grab a cross-section (either rows or columns) from our dataframe...
By rows:-
In [47]: df.xs('one') Out[47]: A B 1 -0.732470 -0.313871 2 -0.031109 -2.068794 3 1.520652 0.471764
By columns:-
In [48]: df.xs('B', axis=1) Out[48]: one 1 -0.313871 2 -2.068794 3 0.471764 two 1 -1.204458 2 -0.455419 3 -0.915983 Name: B
Great post by @Calvin Cheng, but thought I'd take a stab at this as well.
When to use a MultiIndex:
Why (your core question) - at least these are the biggest benefits IMO:
Example:
Dollars Units Date Store Category Subcategory UPC EAN 2018-07-10 Store 1 Alcohol Liqour 80480280024 154.77 7 Store 2 Alcohol Liqour 80480280024 82.08 4 Store 3 Alcohol Liqour 80480280024 259.38 9 Store 1 Alcohol Liquor 80432400630 477.68 14 674545000001 139.68 4 Store 2 Alcohol Liquor 80432400630 203.88 6 674545000001 377.13 13 Store 3 Alcohol Liquor 80432400630 239.19 7 674545000001 432.32 14 Store 1 Beer Ales 94922755711 65.17 7 702770082018 174.44 14 736920111112 50.70 5 Store 2 Beer Ales 94922755711 129.60 12 702770082018 107.40 10 736920111112 59.65 5 Store 3 Beer Ales 94922755711 154.00 14 702770082018 137.40 10 736920111112 107.88 12 Store 1 Beer Lagers 702770081011 156.24 12 Store 2 Beer Lagers 702770081011 137.06 11 Store 3 Beer Lagers 702770081011 119.52 8
1) If we want to easily compare sales across stores, we can use df.unstack('Store')
to line everything up side-by-side:
Dollars Units Store Store 1 Store 2 Store 3 Store 1 Store 2 Store 3 Date Category Subcategory UPC EAN 2018-07-10 Alcohol Liqour 80480280024 154.77 82.08 259.38 7 4 9 Liquor 80432400630 477.68 203.88 239.19 14 6 7 674545000001 139.68 377.13 432.32 4 13 14 Beer Ales 94922755711 65.17 129.60 154.00 7 12 14 702770082018 174.44 107.40 137.40 14 10 10 736920111112 50.70 59.65 107.88 5 5 12 Lagers 702770081011 156.24 137.06 119.52 12 11 8
2) We can also easily do math on multiple columns. For example, df['Dollars'] / df['Units']
will then divide each store's dollars by its units, for every store without multiple operations:
Store Store 1 Store 2 Store 3 Date Category Subcategory UPC EAN 2018-07-10 Alcohol Liqour 80480280024 22.11 20.52 28.82 Liquor 80432400630 34.12 33.98 34.17 674545000001 34.92 29.01 30.88 Beer Ales 94922755711 9.31 10.80 11.00 702770082018 12.46 10.74 13.74 736920111112 10.14 11.93 8.99 Lagers 702770081011 13.02 12.46 14.94
3) If we then want to filter to just specific rows, instead of using the
df[(df[col1] == val1) and (df[col2] == val2) and (df[col3] == val3)]
format, we can instead .xs or .query (yes these work for regular dfs, but it's not very useful). The syntax would instead be:
df.xs((val1, val2, val3), level=(col1, col2, col3))
More examples can be found in this tutorial notebook I put together.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With