I'm indexing a large multi-index Pandas df using df.loc[(key1, key2)]
. Sometimes I get a series back (as expected), but other times I get a dataframe. I'm trying to isolate the cases which cause the latter, but so far all I can see is that it's correlated with getting a PerformanceWarning: indexing past lexsort depth may impact performance
warning.
I'd like to reproduce it to post here, but I can't generate another case that gives me the same warning. Here's my attempt:
def random_dates(start, end, n=10): start_u = start.value//10**9 end_u = end.value//10**9 return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s') np.random.seed(0) df = pd.DataFrame(np.random.random(3255000).reshape(465000,7)) # same shape as my data df['date'] = random_dates(pd.to_datetime('1990-01-01'), pd.to_datetime('2018-01-01'), 465000) df = df.set_index([0, 'date']) df = df.sort_values(by=[3]) # unsort indices, just in case df.index.lexsort_depth > 0 df.index.is_monotonic > False df.loc[(0.9987185534991936, pd.to_datetime('2012-04-16 07:04:34'))] # no warning
So my question is: what causes this warning? How do I artificially induce it?
Definition and Usage. The index property returns the index information of the DataFrame. The index information contains the labels of the rows. If the rows has NOT named indexes, the index property returns a RangeIndex object with the start, stop, and step values.
An index on a Pandas DataFrame gives us a way to identify rows. Identifying rows by a “label” is arguably better than identifying a row by number. If you only have the integer position to work with, you have to remember the number for each row.
set_index() to set the multi-level index of pandas DataFrame using a combination of a new list and the existing column. We need to create a Python Index object from a list of new labels and pass that Index object and an existing column label as input to the DataFrame. set_index() function to create a two-level index.
Sort your DataFrame's index using df.sort_index()
to address the warning and improve performance.
I've actually written about this in detail in my writeup: Select rows in pandas MultiIndex DataFrame (under "Question 3").
To reproduce,
mux = pd.MultiIndex.from_arrays([ list('aaaabbbbbccddddd'), list('tuvwtuvwtuvwtuvw') ], names=['one', 'two']) df = pd.DataFrame({'col': np.arange(len(mux))}, mux) col one two a t 0 u 1 v 2 w 3 b t 4 u 5 v 6 w 7 t 8 c u 9 v 10 d w 11 t 12 u 13 v 14 w 15
You'll notice that the second level is not properly sorted.
Now, try to index a specific cross section:
df.loc[pd.IndexSlice[('c', 'u')]] PerformanceWarning: indexing past lexsort depth may impact performance. # encoding: utf-8 col one two c u 9
You'll see the same behaviour with xs
:
df.xs(('c', 'u'), axis=0) PerformanceWarning: indexing past lexsort depth may impact performance. self.interact() col one two c u 9
The docs, backed by this timing test I once did seem to suggest that handling un-sorted indexes imposes a slowdown—Indexing is O(N) time when it could/should be O(1).
If you sort the index before slicing, you'll notice the difference:
df2 = df.sort_index() df2.loc[pd.IndexSlice[('c', 'u')]] col one two c u 9 %timeit df.loc[pd.IndexSlice[('c', 'u')]] %timeit df2.loc[pd.IndexSlice[('c', 'u')]] 802 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 648 µs ± 20.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Finally, if you want to know whether the index is sorted or not, check with MultiIndex.is_lexsorted
.
df.index.is_lexsorted() # False df2.index.is_lexsorted() # True
As for your question on how to induce this behaviour, simply permuting the indices should suffice. This works if your index is unique:
df2 = df.loc[pd.MultiIndex.from_tuples(np.random.permutation(df2.index))]
If your index is not unique, add a cumcount
ed level first,
df.set_index( df.groupby(level=list(range(len(df.index.levels)))).cumcount(), append=True) df2 = df.loc[pd.MultiIndex.from_tuples(np.random.permutation(df2.index))] df2 = df2.reset_index(level=-1, drop=True)
According to pandas advanced indexing (Sorting a Multiindex)
On higher dimensional objects, you can sort any of the other axes by level if they have a MultiIndex
And also:
Indexing will work even if the data are not sorted, but will be rather inefficient (and show a PerformanceWarning). It will also return a copy of the data rather than a view:
According to them, you may need to ensure that indices are sorted properly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With