When creating a Pandas dataframe with a MultiIndex, the levels seem to always be sorted:
>>> pd.DataFrame([range(4)], columns=pd.MultiIndex.from_product([["b", "a"], [20, 10]]))
b a
20 10 20 10
0 0 1 2 3
>>> _.columns
MultiIndex(levels=[[u'a', u'b'], [10, 20]],
labels=[[1, 1, 0, 0], [1, 0, 1, 0]])
(Note how levels
is sorted.) Is this guaranteed? Knowing this can help write robust code (since we can then rely on a simple property of MultiIndices).
I can't find any guarantee in the documentation (but then this doesn't mean that it couldn't be there!).
There are also old examples (from 2015) that show a different behavior, but maybe does Pandas now offer guarantees on the ordering of levels (in the same way as Python 3.6 offers a guarantee on the order of keys in dictionaries)?
When creating a MultiIndex
using from_product()
or from_arrays()
levels will be sorted because both methods use _factorize_from_iterables()
which returns the indexes sorted.
>> list(_factorize_from_iterables([["b", "a"], [20, 10]]))
[[array([1, 0], dtype=int8), array([1, 0], dtype=int8)],
[Index(['a', 'b'], dtype='object'), Int64Index([10, 20], dtype='int64')]]
MultiIndex.from_tuples()
will also have sorted levels because it uses from_arrays()
internally.
If you set MultiIndex
without specifying a method however, levels won't be sorted.
>> midx = pd.MultiIndex(levels=[['b', 'a'], [20, 10]],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
>> df = pd.DataFrame(np.random.randn(4,4), columns=midx)
>> df.columns
MultiIndex(levels=[['b', 'a'], [20, 10]],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
Above uses pandas
version 0.22.0
(released in december 29, 2017) and is tested on version 0.23.4
(latest release).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With