Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What causes "indexing past lexsort depth" warning in Pandas?

Tags:

python

pandas

I'm indexing a large multi-index Pandas df using df.loc[(key1, key2)]. Sometimes I get a series back (as expected), but other times I get a dataframe. I'm trying to isolate the cases which cause the latter, but so far all I can see is that it's correlated with getting a PerformanceWarning: indexing past lexsort depth may impact performance warning.

I'd like to reproduce it to post here, but I can't generate another case that gives me the same warning. Here's my attempt:

def random_dates(start, end, n=10):     start_u = start.value//10**9     end_u = end.value//10**9     return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')  np.random.seed(0) df = pd.DataFrame(np.random.random(3255000).reshape(465000,7))  # same shape as my data df['date'] = random_dates(pd.to_datetime('1990-01-01'), pd.to_datetime('2018-01-01'), 465000) df = df.set_index([0, 'date']) df = df.sort_values(by=[3])  # unsort indices, just in case df.index.lexsort_depth > 0 df.index.is_monotonic > False df.loc[(0.9987185534991936, pd.to_datetime('2012-04-16 07:04:34'))] # no warning 

So my question is: what causes this warning? How do I artificially induce it?

like image 295
Josh Friedlander Avatar asked Jan 22 '19 11:01

Josh Friedlander


People also ask

What is PD index in pandas?

Definition and Usage. The index property returns the index information of the DataFrame. The index information contains the labels of the rows. If the rows has NOT named indexes, the index property returns a RangeIndex object with the start, stop, and step values.

Why do we need index in pandas?

An index on a Pandas DataFrame gives us a way to identify rows. Identifying rows by a “label” is arguably better than identifying a row by number. If you only have the integer position to work with, you have to remember the number for each row.

How do I create a panda index?

set_index() to set the multi-level index of pandas DataFrame using a combination of a new list and the existing column. We need to create a Python Index object from a list of new labels and pass that Index object and an existing column label as input to the DataFrame. set_index() function to create a two-level index.


2 Answers

TL;DR: your index is unsorted and this severely impacts performance.

Sort your DataFrame's index using df.sort_index() to address the warning and improve performance.


I've actually written about this in detail in my writeup: Select rows in pandas MultiIndex DataFrame (under "Question 3").

To reproduce,

mux = pd.MultiIndex.from_arrays([     list('aaaabbbbbccddddd'),     list('tuvwtuvwtuvwtuvw') ], names=['one', 'two'])  df = pd.DataFrame({'col': np.arange(len(mux))}, mux)           col one two      a   t      0     u      1     v      2     w      3 b   t      4     u      5     v      6     w      7     t      8 c   u      9     v     10 d   w     11     t     12     u     13     v     14     w     15 

You'll notice that the second level is not properly sorted.

Now, try to index a specific cross section:

df.loc[pd.IndexSlice[('c', 'u')]] PerformanceWarning: indexing past lexsort depth may impact performance.   # encoding: utf-8           col one two      c   u      9 

You'll see the same behaviour with xs:

df.xs(('c', 'u'), axis=0) PerformanceWarning: indexing past lexsort depth may impact performance.   self.interact()           col one two      c   u      9 

The docs, backed by this timing test I once did seem to suggest that handling un-sorted indexes imposes a slowdown—Indexing is O(N) time when it could/should be O(1).

If you sort the index before slicing, you'll notice the difference:

df2 = df.sort_index() df2.loc[pd.IndexSlice[('c', 'u')]]           col one two      c   u      9   %timeit df.loc[pd.IndexSlice[('c', 'u')]] %timeit df2.loc[pd.IndexSlice[('c', 'u')]]  802 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 648 µs ± 20.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 

Finally, if you want to know whether the index is sorted or not, check with MultiIndex.is_lexsorted.

df.index.is_lexsorted() # False  df2.index.is_lexsorted() # True 

As for your question on how to induce this behaviour, simply permuting the indices should suffice. This works if your index is unique:

df2 = df.loc[pd.MultiIndex.from_tuples(np.random.permutation(df2.index))] 

If your index is not unique, add a cumcounted level first,

df.set_index(     df.groupby(level=list(range(len(df.index.levels)))).cumcount(), append=True)  df2 = df.loc[pd.MultiIndex.from_tuples(np.random.permutation(df2.index))] df2 = df2.reset_index(level=-1, drop=True) 
like image 178
cs95 Avatar answered Oct 12 '22 16:10

cs95


According to pandas advanced indexing (Sorting a Multiindex)

On higher dimensional objects, you can sort any of the other axes by level if they have a MultiIndex

And also:

Indexing will work even if the data are not sorted, but will be rather inefficient (and show a PerformanceWarning). It will also return a copy of the data rather than a view:

According to them, you may need to ensure that indices are sorted properly.

like image 40
ndrwnaguib Avatar answered Oct 12 '22 17:10

ndrwnaguib