Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What exactly is the lexsort_depth of a multi-index Dataframe?

Tags:

What exactly is the lexsort_depth of a multi-index dataframe? Why does it have to be sorted for indexing?

For example, I have noticed that, after manually building a multi-index dataframe df with columns organized in three levels, if I try to do:

idx = pd.IndexSlice df[idx['foo', 'bar']] 

I get:

KeyError: 'Key length (2) was greater than MultiIndex lexsort depth (0)' 

and at this point, df.columns.lexsort_depth is 0

However, if I do, as recommended here and here:

df = df.sortlevel(0,axis=1) 

then the cross-section indexing works. Why? What exactly is lexsort_depth, and why sorting with sortlevel fixes this type of indexing?

like image 235
Amelio Vazquez-Reina Avatar asked Nov 25 '14 00:11

Amelio Vazquez-Reina


People also ask

What is MultiIndex DataFrame?

A MultiIndex (also known as a hierarchical index) DataFrame allows you to have multiple columns acting as a row identifier and multiple rows acting as a header identifier. With MultiIndex, you can do some sophisticated data analysis, especially for working with higher dimensional data.

How does pandas handle multiple index columns?

pandas MultiIndex to Columns Use pandas DataFrame. reset_index() function to convert/transfer MultiIndex (multi-level index) indexes to columns. The default setting for the parameter is drop=False which will keep the index values as columns and set the new index to DataFrame starting from zero. Yields below output.


1 Answers

lexsort_depth is the number of levels of a multi-index that are sorted lexically. That is, in an a-b-c-1-2-3 order (normal sort order).

So element indexing will work if a multi-index is not sorted, but the lookups may be quite a bit slower (in 0.15.2, this will show a PerformanceWarning for doing these kinds of lookups, see here

The reason that sorting in general a good idea is that pandas is able to use hash-based indexing to figure out where the location is in a particular level independently for the level. ; then you can use these indexers to find the final locations.

Pandas takes advantage of np.searchsorted to find these locations when its sorted. If its not sorted, then you have to fallback to some different (slower) methods.

here is the code that does this.

like image 59
Jeff Avatar answered Sep 19 '22 19:09

Jeff