From the pandas documentation, I've gathered that unique-valued indices make certain operations efficient, and that non-unique indices are occasionally tolerated.
From the outside, it doesn't look like non-unique indices are taken advantage of in any way. For example, the following ix
query is slow enough that it seems to be scanning the entire dataframe
In [23]: import numpy as np In [24]: import pandas as pd In [25]: x = np.random.randint(0, 10**7, 10**7) In [26]: df1 = pd.DataFrame({'x':x}) In [27]: df2 = df1.set_index('x', drop=False) In [28]: %timeit df2.ix[0] 1 loops, best of 3: 402 ms per loop In [29]: %timeit df1.ix[0] 10000 loops, best of 3: 123 us per loop
(I realize the two ix
queries don't return the same thing -- it's just an example that calls to ix
on a non-unique index appear much slower)
Is there any way to coax pandas into using faster lookup methods like binary search on non-unique and/or sorted indices?
The column you want to index does not need to have unique values.
Non-unique indexes are not used to enforce constraints on the tables with which they are associated. Instead, non-unique indexes are used solely to improve query performance by maintaining a sorted order of data values that are used frequently.
The unique function in pandas is used to find the unique values from a series. A series is a single column of a data frame. We can use the unique function on any possible set of elements in Python. It can be used on a series of strings, integers, tuples, or mixed elements.
Pandas keeps track of data types, indexes and performs error checking — all of which are very useful, but also slow down the calculations. NumPy doesn't do any of that, so it can perform the same calculations significantly faster. There are multiple ways to convert Pandas data to NumPy.
When index is unique, pandas use a hashtable to map key to value O(1). When index is non-unique and sorted, pandas use binary search O(logN), when index is random ordered pandas need to check all the keys in the index O(N).
You can call sort_index
method:
import numpy as np import pandas as pd x = np.random.randint(0, 200, 10**6) df1 = pd.DataFrame({'x':x}) df2 = df1.set_index('x', drop=False) df3 = df2.sort_index() %timeit df1.loc[100] %timeit df2.loc[100] %timeit df3.loc[100]
result:
10000 loops, best of 3: 71.2 µs per loop 10 loops, best of 3: 38.9 ms per loop 10000 loops, best of 3: 134 µs per loop
@HYRY said it well, but nothing says it quite like a colourful graph with timings.
Plots were generated using perfplot. Code, for your reference:
import pandas as pd import perfplot _rnd = np.random.RandomState(42) def make_data(n): x = _rnd.randint(0, 200, n) df1 = pd.DataFrame({'x':x}) df2 = df1.set_index('x', drop=False) df3 = df2.sort_index() return df1, df2, df3 perfplot.show( setup=lambda n: make_data(n), kernels=[ lambda dfs: dfs[0].loc[100], lambda dfs: dfs[1].loc[100], lambda dfs: dfs[2].loc[100], ], labels=['Unique index', 'Non-unique, unsorted index', 'Non-unique, sorted index'], n_range=[2 ** k for k in range(8, 23)], xlabel='N', logx=True, logy=True, equality_check=False)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With