I want to find the intersection of two dataframe indexes. One approach would be to use the inbuilt pd.Index.intersection()
method. As in:
# Dummy data just to make an index with some common values
ix = pd.read_excel('canadacities.xlsx', index_col=0,).index
# Index A
ix_a = ix[0:500]
# Index B
ix_b = ix[200:700]
# Finding the intersection
%%timeit
common_index = ix_a.intersection(ix_b)
# 767 µs ± 10.1 µs per loop
Alternatively, I can use sets to do the same job
%%timeit
# Alternative 2, use sets
common_index = list(set(ix_a) & set(ix_b))
# 103 µs ± 685 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
I am surprised that the internal pandas method proves so much slower. This is true even if I then include the resulting index in selecting the common rows in a dataframe. The native python index generation and selection is slower...
%%timeit
common_index = ix_a.intersection(ix_b)
foo = df.loc[common_index, :]
# 2.81 ms ± 100 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
... than its more custom counterpart
%%timeit
common_index = list(set(ix_a) & set(ix_b))
foo = df.loc[common_index, :]
#1.65 ms ± 7.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So my questions would be:
Index in pandas is a NumPy array. As such, it is going to have a worse performance characteristic for set operations than Python set
which is optimized for such an operation - underlying implementation is a hash map which greatly reduces the time complexity of checking if a value is in a set to O(1).
For the NumPy array optimization is for quick traversal, so it won't be ever so fast to perform an operation alluding to set operation by its name but actually performed in a much different way.
In your particular situation the gain may be in the elegance of the call to one method instead of using an expression that is somewhat more cryptic on the first glance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With