When creating a Pandas dataframe with a MultiIndex, the levels seem to always be sorted:
>>> pd.DataFrame([range(4)], columns=pd.MultiIndex.from_product([["b", "a"], [20, 10]]))
b a
20 10 20 10
0 0 1 2 3
>>> _.columns
MultiIndex(levels=[[u'a', u'b'], [10, 20]],
labels=[[1, 1, 0, 0], [1, 0, 1, 0]])
(Note how levels is sorted.) Is this guaranteed? Knowing this can help write robust code (since we can then rely on a simple property of MultiIndices).
I can't find any guarantee in the documentation (but then this doesn't mean that it couldn't be there!).
There are also old examples (from 2015) that show a different behavior, but maybe does Pandas now offer guarantees on the ordering of levels (in the same way as Python 3.6 offers a guarantee on the order of keys in dictionaries)?
When creating a MultiIndex using from_product() or from_arrays() levels will be sorted because both methods use _factorize_from_iterables() which returns the indexes sorted.
>> list(_factorize_from_iterables([["b", "a"], [20, 10]]))
[[array([1, 0], dtype=int8), array([1, 0], dtype=int8)],
[Index(['a', 'b'], dtype='object'), Int64Index([10, 20], dtype='int64')]]
MultiIndex.from_tuples() will also have sorted levels because it uses from_arrays() internally.
If you set MultiIndex without specifying a method however, levels won't be sorted.
>> midx = pd.MultiIndex(levels=[['b', 'a'], [20, 10]],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
>> df = pd.DataFrame(np.random.randn(4,4), columns=midx)
>> df.columns
MultiIndex(levels=[['b', 'a'], [20, 10]],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
Above uses pandas version 0.22.0 (released in december 29, 2017) and is tested on version 0.23.4 (latest release).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With