What causes "indexing past lexsort depth" warning in Pandas?

Tags:

I'm indexing a large multi-index Pandas df using df.loc[(key1, key2)]. Sometimes I get a series back (as expected), but other times I get a dataframe. I'm trying to isolate the cases which cause the latter, but so far all I can see is that it's correlated with getting a PerformanceWarning: indexing past lexsort depth may impact performance warning.

I'd like to reproduce it to post here, but I can't generate another case that gives me the same warning. Here's my attempt:

def random_dates(start, end, n=10):     start_u = start.value//10**9     end_u = end.value//10**9     return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')  np.random.seed(0) df = pd.DataFrame(np.random.random(3255000).reshape(465000,7))  # same shape as my data df['date'] = random_dates(pd.to_datetime('1990-01-01'), pd.to_datetime('2018-01-01'), 465000) df = df.set_index([0, 'date']) df = df.sort_values(by=[3])  # unsort indices, just in case df.index.lexsort_depth > 0 df.index.is_monotonic > False df.loc[(0.9987185534991936, pd.to_datetime('2012-04-16 07:04:34'))] # no warning

So my question is: what causes this warning? How do I artificially induce it?

295

asked Jan 22 '19 11:01

Josh Friedlander

2 Answers

TL;DR: your index is unsorted and this severely impacts performance.

Sort your DataFrame's index using df.sort_index() to address the warning and improve performance.

I've actually written about this in detail in my writeup: Select rows in pandas MultiIndex DataFrame (under "Question 3").

To reproduce,

mux = pd.MultiIndex.from_arrays([     list('aaaabbbbbccddddd'),     list('tuvwtuvwtuvwtuvw') ], names=['one', 'two'])  df = pd.DataFrame({'col': np.arange(len(mux))}, mux)           col one two      a   t      0     u      1     v      2     w      3 b   t      4     u      5     v      6     w      7     t      8 c   u      9     v     10 d   w     11     t     12     u     13     v     14     w     15

You'll notice that the second level is not properly sorted.

Now, try to index a specific cross section:

df.loc[pd.IndexSlice[('c', 'u')]] PerformanceWarning: indexing past lexsort depth may impact performance.   # encoding: utf-8           col one two      c   u      9

You'll see the same behaviour with xs:

df.xs(('c', 'u'), axis=0) PerformanceWarning: indexing past lexsort depth may impact performance.   self.interact()           col one two      c   u      9

The docs, backed by this timing test I once did seem to suggest that handling un-sorted indexes imposes a slowdown—Indexing is O(N) time when it could/should be O(1).

If you sort the index before slicing, you'll notice the difference:

df2 = df.sort_index() df2.loc[pd.IndexSlice[('c', 'u')]]           col one two      c   u      9   %timeit df.loc[pd.IndexSlice[('c', 'u')]] %timeit df2.loc[pd.IndexSlice[('c', 'u')]]  802 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 648 µs ± 20.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Finally, if you want to know whether the index is sorted or not, check with MultiIndex.is_lexsorted.

df.index.is_lexsorted() # False  df2.index.is_lexsorted() # True

As for your question on how to induce this behaviour, simply permuting the indices should suffice. This works if your index is unique:

df2 = df.loc[pd.MultiIndex.from_tuples(np.random.permutation(df2.index))]

If your index is not unique, add a cumcounted level first,

df.set_index(     df.groupby(level=list(range(len(df.index.levels)))).cumcount(), append=True)  df2 = df.loc[pd.MultiIndex.from_tuples(np.random.permutation(df2.index))] df2 = df2.reset_index(level=-1, drop=True)

178

answered Oct 12 '22 16:10

cs95

According to pandas advanced indexing (Sorting a Multiindex)

On higher dimensional objects, you can sort any of the other axes by level if they have a MultiIndex

And also:

Indexing will work even if the data are not sorted, but will be rather inefficient (and show a PerformanceWarning). It will also return a copy of the data rather than a view:

According to them, you may need to ensure that indices are sorted properly.

answered Oct 12 '22 17:10

ndrwnaguib

Related questions
                            
                                Map list item to function with arguments
                            
                                Iterating over a 2 dimensional python list [duplicate]
                            
                                How to easily distribute Python software that has Python module dependencies? Frustrations in Python package installation on Unix
                            
                                Python function argument list formatting
                            
                                How do I correctly install dulwich to get hg-git working on Windows?
                            
                                Should I use `random.seed` or `numpy.random.seed` to control random number generation in `scikit-learn`?
                            
                                Can I get a reference to a Python property?
                            
                                Store different datatypes in one NumPy array?
                            
                                Releasing memory of huge numpy array in IPython
                            
                                How should I stop a busy cell in an iPython notebook?
                            
                                How to properly use coverage.py in Python?
                            
                                \text does not work in a matplotlib label
                            
                                Get the column names of a python numpy ndarray
                            
                                Are Python built-in containers thread-safe?
                            
                                TypeError: unhashable type: 'list' when using built-in set function
                            
                                Python debugger: Stepping into a function that you have called interactively
                            
                                Python Pandas: Is Order Preserved When Using groupby() and agg()?
                            
                                selecting attribute values from lxml
                            
                                scikit-learn cross validation, negative values with mean squared error
                            
                                What does sudo -H do?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What causes "indexing past lexsort depth" warning in Pandas?

Tags:

python

pandas