Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance: Pandas index.intersection() vs set intersection

I want to find the intersection of two dataframe indexes. One approach would be to use the inbuilt pd.Index.intersection() method. As in:

# Dummy data just to make an index with some common values
ix = pd.read_excel('canadacities.xlsx', index_col=0,).index

# Index A
ix_a = ix[0:500]

# Index B
ix_b = ix[200:700]

# Finding the intersection
%%timeit
common_index = ix_a.intersection(ix_b)
# 767 µs ± 10.1 µs per loop 

Alternatively, I can use sets to do the same job

%%timeit

# Alternative 2, use sets
common_index = list(set(ix_a) & set(ix_b))
# 103 µs ± 685 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

I am surprised that the internal pandas method proves so much slower. This is true even if I then include the resulting index in selecting the common rows in a dataframe. The native python index generation and selection is slower...

%%timeit
common_index = ix_a.intersection(ix_b)
foo = df.loc[common_index, :]
# 2.81 ms ± 100 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

... than its more custom counterpart

%%timeit
common_index = list(set(ix_a) & set(ix_b))
foo = df.loc[common_index, :]
#1.65 ms ± 7.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So my questions would be:

  1. Why is the pandas internal method slower?
  2. Is there some compensating advantage to the slower method?
like image 905
billjoie Avatar asked Sep 11 '25 04:09

billjoie


1 Answers

Index in pandas is a NumPy array. As such, it is going to have a worse performance characteristic for set operations than Python set which is optimized for such an operation - underlying implementation is a hash map which greatly reduces the time complexity of checking if a value is in a set to O(1).

For the NumPy array optimization is for quick traversal, so it won't be ever so fast to perform an operation alluding to set operation by its name but actually performed in a much different way.

In your particular situation the gain may be in the elegance of the call to one method instead of using an expression that is somewhat more cryptic on the first glance.

like image 142
sophros Avatar answered Sep 13 '25 17:09

sophros