Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ensuring lexicographical sort in pandas MultiIndex

Tags:

pandas

I've got some data with a MultiIndex (some timing stats, with index levels for "device", "build configuration", "tested function", etc). I want to slice out on some of those index columns.

It seems like "slicers" to the .loc function are probably the way to go. However the docs contain this warning:

Warning: You will need to make sure that the selection axes are fully lexsorted!

Later on in the docs there's a section on The Need for Sortedness with MultiIndex which says

you are responsible for ensuring that things are properly sorted

but thankfully,

The MultiIndex object has code to explicity check the sort depth. Thus, if you try to index at a depth at which the index is not sorted, it will raise an exception.

Sounds fine.

However the remaining question is how does one get their data properly sorted for the indexing to work properly? The docs talk about an important new method sortlevel() but then contains the following caveat:

There is an important new method sortlevel to sort an axis within a MultiIndex so that its labels are grouped and sorted by the original ordering of the associated factor at that level. Note that this does not necessarily mean the labels will be sorted lexicographically!

In my case, sortlevel() did the right thing, but what if my "original ordering of the associated factor" was not sorted? Is there a simple one-liner that I can use on any MultiIndex-ed DataFrame to ensure it's ready for slicing and fully lexsorted?


Edit: My exploration suggests most ways of creating a MultiIndex automatically lexsorts the unique labels when building the index. Example:

In [1]: 
import pandas as pd
df = pd.DataFrame({'col1': ['b','d','b','a'], 'col2': [3,1,1,2],
                  'data':['one','two','three','four']})
df

Out[1]: 
  col1  col2   data
0    b     3    one
1    d     1    two
2    b     1  three
3    a     2   four

In [2]:
df2 = df.set_index(['col1','col2'])
df2

Out[2]: 
            data
col1 col2       
b    3       one
d    1       two
b    1     three
a    2      four

In [3]: df2.index
Out[3]: 
MultiIndex(levels=[[u'a', u'b', u'd'], [1, 2, 3]],
           labels=[[1, 2, 1, 0], [2, 0, 0, 1]],
           names=[u'col1', u'col2'])

Note how the unique items in the levels array are lexsorted, even though the DataFrame object is itself is not. Then, as expected:

In [4]: df2.index.is_lexsorted()
Out[4]: False

In [5]: 
sorted = df2.sortlevel()
sorted
Out[5]: 
            data
col1 col2       
a    2      four
b    1     three
     3       one
d    1       two

In [6]: sorted.index.is_lexsorted()
Out[6]: True

However, if the levels are explicitly ordered so they are not sorted, things get weird:

In [7]:
df3 = df2
df3.index.set_levels(['b','d','a'], level='col1', inplace=True)
df3.index.set_labels([0,1,0,2], level='col1', inplace=True)
df3

Out[7]: 
            data
col1 col2       
b    3       one
d    1       two
b    1     three
a    2      four

In [8]:
sorted2 = df3.sortlevel()
sorted2

Out[8]: 
            data
col1 col2       
b    1     three
     3       one
d    1       two
a    2      four

In [9]: sorted2.index.is_lexsorted()
Out[9]: True

In [10]: sorted2.index
Out[10]: 
MultiIndex(levels=[[u'b', u'd', u'a'], [1, 2, 3]],
           labels=[[0, 0, 1, 2], [0, 2, 0, 1]],
           names=[u'col1', u'col2'])

So sorted2 is reporting that it is lexsorted, when in fact it is not. This feels a little like what the warning in the docs is getting at, but it's still not clear how to fix it or whether it's really an issue at all.

like image 276
tangobravo Avatar asked Nov 10 '22 09:11

tangobravo


1 Answers

As far as sorting, as @EdChum pointed out, the docs here seem to indicate it is lexicographically sorted.

For checking whether your index (or columns) are sorted, they have a method is_lexsorted() and an attribute lexsort_depth (which for some reason you can't really find in the documentation itself).

Example:

Create a Series with random order

In [1]:
import pandas as pd
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
            ['one', 'two', '1', '3', 'one', 'two', 'one', 'two']]

tuples = list(zip(*arrays))
import random; random.shuffle(tuples)
s = pd.Series(np.random.randn(8), index=pd.MultiIndex.from_tuples(tuples))
s

Out[1]:
baz  3     -0.191653
qux  two   -1.410311
bar  one   -0.336475
qux  one   -1.192908
foo  two    0.486401
baz  1      0.888314
foo  one   -1.504816
bar  two    0.917460
dtype: float64

Check is_lexsorted and lexsort_depth:

In [2]: s.index.is_lexsorted()
Out[2]: False

In [3]: s.index.lexsort_depth
Out[3]: 0

Sort the index, and recheck the values:

In [4]: s = s.sortlevel(0, sort_remaining=True)
        s

Out[4]:
bar  one   -0.336475
     two    0.917460
baz  1      0.888314
     3     -0.191653
foo  one   -1.504816
     two    0.486401
qux  one   -1.192908
     two   -1.410311
dtype: float64

In [5]: s.index.is_lexsorted()
Out[5]: True

In [6]: s.index.lexsort_depth  
Out[6]: 2
like image 158
Julien Marrec Avatar answered Dec 17 '22 18:12

Julien Marrec