Why is merging dataframes in Pandas on an index more efficient (faster) than on a column?
import pandas as pd
# Dataframes share the ID column
df = pd.DataFrame({'ID': [0, 1, 2, 3, 4],
'Job': ['teacher', 'scientist', 'manager', 'teacher', 'nurse']})
df2 = pd.DataFrame({'ID': [2, 3, 4, 5, 6, 7, 8],
'Level': [12, 15, 14, 20, 21, 11, 15],
'Age': [33, 41, 42, 50, 45, 28, 32]})
df = df.set_index('ID')
df2 = df2.set_index('ID')
This represents about a 3.5 times speed up! (Using Pandas 0.23.0)
Reading through the Pandas internals page it says an Index "Populates a dict of label to location in Cython to do O(1) lookups." Does this mean that doing operations with an index is more efficient than with columns? Is it a best practice to always use the index for operations such as merges?
I read through the documentation for joining and merging and it doesn't explicitly mention any benefits to using the index.
The reason for this is that the DataFrame's index is backed by a hash table.
To merge two sets, we need to find for each element of the first the corresponding in the second (if it exists) Searching is significantly faster if supported by a hash table because searching in an unsorted list is O(N), while in a list supported by a hash function ~O(1).
One strategy that could be faster to merge columns would be to first create a hash table for the smallest of the two. Still that means that the merge will be slower by the time it takes to create this dict.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With