Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging dataframes on an index is more efficient in Pandas

Why is merging dataframes in Pandas on an index more efficient (faster) than on a column?

import pandas as pd

# Dataframes share the ID column
df = pd.DataFrame({'ID': [0, 1, 2, 3, 4],
                   'Job': ['teacher', 'scientist', 'manager', 'teacher', 'nurse']})

df2 = pd.DataFrame({'ID': [2, 3, 4, 5, 6, 7, 8],
                    'Level': [12, 15, 14, 20, 21, 11, 15], 
                    'Age': [33, 41, 42, 50, 45, 28, 32]})

enter image description here

df = df.set_index('ID')
df2 = df2.set_index('ID')

enter image description here

This represents about a 3.5 times speed up! (Using Pandas 0.23.0)

Reading through the Pandas internals page it says an Index "Populates a dict of label to location in Cython to do O(1) lookups." Does this mean that doing operations with an index is more efficient than with columns? Is it a best practice to always use the index for operations such as merges?

I read through the documentation for joining and merging and it doesn't explicitly mention any benefits to using the index.

like image 993
willk Avatar asked Jun 21 '18 14:06

willk


Video Answer


1 Answers

The reason for this is that the DataFrame's index is backed by a hash table.

To merge two sets, we need to find for each element of the first the corresponding in the second (if it exists) Searching is significantly faster if supported by a hash table because searching in an unsorted list is O(N), while in a list supported by a hash function ~O(1).

One strategy that could be faster to merge columns would be to first create a hash table for the smallest of the two. Still that means that the merge will be slower by the time it takes to create this dict.

like image 188
ntg Avatar answered Sep 16 '22 11:09

ntg