I've got some data with a MultiIndex (some timing stats, with index levels for "device", "build configuration", "tested function", etc). I want to slice out on some of those index columns.
It seems like "slicers" to the .loc function are probably the way to go. However the docs contain this warning:
Warning: You will need to make sure that the selection axes are fully lexsorted!
Later on in the docs there's a section on The Need for Sortedness with MultiIndex which says
you are responsible for ensuring that things are properly sorted
but thankfully,
The MultiIndex object has code to explicity check the sort depth. Thus, if you try to index at a depth at which the index is not sorted, it will raise an exception.
Sounds fine.
However the remaining question is how does one get their data properly sorted for the indexing to work properly? The docs talk about an important new method sortlevel()
but then contains the following caveat:
There is an important new method sortlevel to sort an axis within a MultiIndex so that its labels are grouped and sorted by the original ordering of the associated factor at that level. Note that this does not necessarily mean the labels will be sorted lexicographically!
In my case, sortlevel() did the right thing, but what if my "original ordering of the associated factor" was not sorted? Is there a simple one-liner that I can use on any MultiIndex-ed DataFrame to ensure it's ready for slicing and fully lexsorted?
Edit: My exploration suggests most ways of creating a MultiIndex automatically lexsorts the unique labels when building the index. Example:
In [1]:
import pandas as pd
df = pd.DataFrame({'col1': ['b','d','b','a'], 'col2': [3,1,1,2],
'data':['one','two','three','four']})
df
Out[1]:
col1 col2 data
0 b 3 one
1 d 1 two
2 b 1 three
3 a 2 four
In [2]:
df2 = df.set_index(['col1','col2'])
df2
Out[2]:
data
col1 col2
b 3 one
d 1 two
b 1 three
a 2 four
In [3]: df2.index
Out[3]:
MultiIndex(levels=[[u'a', u'b', u'd'], [1, 2, 3]],
labels=[[1, 2, 1, 0], [2, 0, 0, 1]],
names=[u'col1', u'col2'])
Note how the unique items in the levels array are lexsorted, even though the DataFrame object is itself is not. Then, as expected:
In [4]: df2.index.is_lexsorted()
Out[4]: False
In [5]:
sorted = df2.sortlevel()
sorted
Out[5]:
data
col1 col2
a 2 four
b 1 three
3 one
d 1 two
In [6]: sorted.index.is_lexsorted()
Out[6]: True
However, if the levels are explicitly ordered so they are not sorted, things get weird:
In [7]:
df3 = df2
df3.index.set_levels(['b','d','a'], level='col1', inplace=True)
df3.index.set_labels([0,1,0,2], level='col1', inplace=True)
df3
Out[7]:
data
col1 col2
b 3 one
d 1 two
b 1 three
a 2 four
In [8]:
sorted2 = df3.sortlevel()
sorted2
Out[8]:
data
col1 col2
b 1 three
3 one
d 1 two
a 2 four
In [9]: sorted2.index.is_lexsorted()
Out[9]: True
In [10]: sorted2.index
Out[10]:
MultiIndex(levels=[[u'b', u'd', u'a'], [1, 2, 3]],
labels=[[0, 0, 1, 2], [0, 2, 0, 1]],
names=[u'col1', u'col2'])
So sorted2 is reporting that it is lexsorted, when in fact it is not. This feels a little like what the warning in the docs is getting at, but it's still not clear how to fix it or whether it's really an issue at all.
As far as sorting, as @EdChum pointed out, the docs here seem to indicate it is lexicographically sorted.
For checking whether your index (or columns) are sorted, they have a method is_lexsorted()
and an attribute lexsort_depth
(which for some reason you can't really find in the documentation itself).
Example:
Create a Series with random order
In [1]:
import pandas as pd
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', '1', '3', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
import random; random.shuffle(tuples)
s = pd.Series(np.random.randn(8), index=pd.MultiIndex.from_tuples(tuples))
s
Out[1]:
baz 3 -0.191653
qux two -1.410311
bar one -0.336475
qux one -1.192908
foo two 0.486401
baz 1 0.888314
foo one -1.504816
bar two 0.917460
dtype: float64
Check is_lexsorted and lexsort_depth:
In [2]: s.index.is_lexsorted()
Out[2]: False
In [3]: s.index.lexsort_depth
Out[3]: 0
Sort the index, and recheck the values:
In [4]: s = s.sortlevel(0, sort_remaining=True)
s
Out[4]:
bar one -0.336475
two 0.917460
baz 1 0.888314
3 -0.191653
foo one -1.504816
two 0.486401
qux one -1.192908
two -1.410311
dtype: float64
In [5]: s.index.is_lexsorted()
Out[5]: True
In [6]: s.index.lexsort_depth
Out[6]: 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With