How are merges so fast in Pandas? Even when I haven't sorted on the index?

Tags:

I was merging two data sets in pandas, and wanted to speed the process up, so I sorted both of them on the column that was being used for merge. (Previously those columns had been not at all ordered.) Sorting caused no discernible speed difference; both took about eight seconds.

If I was manually merging two stacks of paper based on, say, their page number, I would first sort each of them by page number. Otherwise I'd have to do a lot of flipping back and forth between the stacks.

I wrote a test to compare the two processes. It generates two frames with a million rows each, in random order. Then it generates two more that have been sorted on the first column. Then it merges the first two, and last, merges the second two.

The data generation process was so slow that I don't have time to try more rows -- but the merge nonetheless happens in zero perceptible time, even without sorting.

import pandas as pd
import numpy as np

def shuffle(df, n=1, axis=0):
  """ https://stackoverflow.com/questions/15772009/shuffling-permutating-a-dataframe-in-pandas """
  df = df.copy()
  for _ in range(n):
    df.apply(np.random.shuffle, axis=axis)
  return df

# Create some shuffled data
df1 = pd.DataFrame({'A':range(1000000), 'B':range(1000000)})
df2 = pd.DataFrame({'A':range(1000000), 'B':range(1000000)})
df1 = shuffle(df1)
df2 = shuffle(df2)

# Sort that data on column A
df1s = df1.sort_values(by='A')
df2s = df2.sort_values(by='A')

m  = df1. merge(df2,  on='A') # Merge the unsorted data
ms = df1s.merge(df2s, on='A') # Merge the sorted data

EDIT: I wrote a test with data that's 50 times wider and 1/5 as tall, and now the sorting seems to help?

import pandas as pd
import numpy as np

def shuffle(df, n=1, axis=0):
  """ https://stackoverflow.com/questions/15772009/shuffling-permutating-a-dataframe-in-pandas """
  df = df.copy()
  for _ in range(n):
    df.apply(np.random.shuffle, axis=axis)
  return df

# Create some shuffled data
nrows = 200000
reorderedIntegers  = shuffle( pd.DataFrame({ 'A':range(nrows) }) )
reorderedIntegers2 = shuffle( pd.DataFrame({ 'A':range(nrows) }) )

# Widen it
extraColumns = pd.DataFrame( np.random.rand( nrows, 100 ) )
df  = pd.concat( [reorderedIntegers,  extraColumns], axis=1 )
df2 = pd.concat( [reorderedIntegers2, extraColumns], axis=1 )

# Create sorted copies
dfs  = df .sort_values(by='A')
dfs2 = df2.sort_values(by='A')

# Compare merge speeds
m  = df. merge(df2,  on='A') # Merge the unsorted data
ms = dfs.merge(df2s, on='A') # Merge the sorted data -- substantially faster now

P.S. I wanted to use timeit.timeit() to measure the speed of the two routines, but I kept getting an error like the following:

>>> timeit.timeit( "ms = df1s.merge(df2s, on='A')" )
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/lib/python3.6/timeit.py", line 233, in timeit
    return Timer(stmt, setup, timer, globals).timeit(number)
  File "/opt/conda/lib/python3.6/timeit.py", line 178, in timeit
    timing = self.inner(it, self.timer)
  File "<timeit-src>", line 6, in inner
NameError: name 'df1s' is not defined

309

asked Mar 20 '18 20:03

Jeffrey Benjamin Brown

1 Answers

For starters, a pandas DataFrame is not implemented as a simple multidimensional array. In the code it describes the object as a:

Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects.

Which is pretty complex and I don't expect anyone to wrap their head around that right away.

What this mentions is that it 'can be though of' as a dictionary like object. This mean that it may use some sort of hash map, implying a constant lookup time.

Merging a hash maps is not comparable to merging arrays (which is what merging 2 stacks of papers is doing) as the backend structure is completely different. Therefore sorting would not make a difference.

Unfortunately the connection between a DataFrame and a hash map is not perfect. Hash maps typically are unsorted and cannot have duplicate entries, neither of which match the DataFrame object implementation.

The other possibility, which seems more likely from looking into the code, is that since the merge opperation does not assume that the column is sorted, it goes ahead and sorts the columns itself before applying a more array-like merge. This means that pre-sorting will not make a difference as merging will re-sort the column anyways.

The pandas DataFrame object code can be found here and the merge opperation for the DataFrame's merge opperation can be found here.

154

answered Nov 02 '22 17:11

Elijah Moreau-Arnott

Related questions
                            
                                parallel assignment performance in Ruby
                            
                                SAS: Individual Table Titles for PROC FREQ or PROC REPORT?
                            
                                Parsing large amount of dates with pandas - scalability - performance drops faster than linear
                            
                                Azure Stream Analytics is too slow - also time values are irrelavant
                            
                                Visual Studio C++ compiler generates 3x slower code when changing completely unrelated code
                            
                                Is python smart enough to replace function calls with constant result?
                            
                                What are some good guidelines for choosing the size of integer types?
                            
                                Efficient way to find manager's manager's id
                            
                                Prevent huge amount of xhr/ajax calls on Facebook embed iframe
                            
                                A better method to implement the alpha synapse function in MATLAB
                            
                                Jmeter : How to increment a value inside the request
                            
                                Code optimization subset sum
                            
                                Why do these goroutines not scale their performance from more concurrent executions?
                            
                                Any way to monitor Postgresql query changes in realtime using LISTEN & NOTIFY (or NodeJS)?
                            
                                Can I optimise this further so that it runs faster?
                            
                                Performance dynamic vs Reflection
                            
                                Slow sendmail performance (javamail) with different mail api jars [duplicate]
                            
                                Is it worth idiomatic programming? An ES6 example [closed]
                            
                                Dictionary with tuple key slower than nested dictionary. Why?
                            
                                Time and Space complexity for calculating many fish are alive? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How are merges so fast in Pandas? Even when I haven't sorted on the index?

Tags:

performance

merge

pandas

Jeffrey Benjamin Brown

People also ask

1 Answers

Elijah Moreau-Arnott

Recent Activity

Donate For Us