Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How are merges so fast in Pandas? Even when I haven't sorted on the index?

I was merging two data sets in pandas, and wanted to speed the process up, so I sorted both of them on the column that was being used for merge. (Previously those columns had been not at all ordered.) Sorting caused no discernible speed difference; both took about eight seconds.

If I was manually merging two stacks of paper based on, say, their page number, I would first sort each of them by page number. Otherwise I'd have to do a lot of flipping back and forth between the stacks.

I wrote a test to compare the two processes. It generates two frames with a million rows each, in random order. Then it generates two more that have been sorted on the first column. Then it merges the first two, and last, merges the second two.

The data generation process was so slow that I don't have time to try more rows -- but the merge nonetheless happens in zero perceptible time, even without sorting.

import pandas as pd
import numpy as np

def shuffle(df, n=1, axis=0):
  """ https://stackoverflow.com/questions/15772009/shuffling-permutating-a-dataframe-in-pandas """
  df = df.copy()
  for _ in range(n):
    df.apply(np.random.shuffle, axis=axis)
  return df

# Create some shuffled data
df1 = pd.DataFrame({'A':range(1000000), 'B':range(1000000)})
df2 = pd.DataFrame({'A':range(1000000), 'B':range(1000000)})
df1 = shuffle(df1)
df2 = shuffle(df2)

# Sort that data on column A
df1s = df1.sort_values(by='A')
df2s = df2.sort_values(by='A')

m  = df1. merge(df2,  on='A') # Merge the unsorted data
ms = df1s.merge(df2s, on='A') # Merge the sorted data

EDIT: I wrote a test with data that's 50 times wider and 1/5 as tall, and now the sorting seems to help?

import pandas as pd
import numpy as np

def shuffle(df, n=1, axis=0):
  """ https://stackoverflow.com/questions/15772009/shuffling-permutating-a-dataframe-in-pandas """
  df = df.copy()
  for _ in range(n):
    df.apply(np.random.shuffle, axis=axis)
  return df

# Create some shuffled data
nrows = 200000
reorderedIntegers  = shuffle( pd.DataFrame({ 'A':range(nrows) }) )
reorderedIntegers2 = shuffle( pd.DataFrame({ 'A':range(nrows) }) )

# Widen it
extraColumns = pd.DataFrame( np.random.rand( nrows, 100 ) )
df  = pd.concat( [reorderedIntegers,  extraColumns], axis=1 )
df2 = pd.concat( [reorderedIntegers2, extraColumns], axis=1 )

# Create sorted copies
dfs  = df .sort_values(by='A')
dfs2 = df2.sort_values(by='A')

# Compare merge speeds
m  = df. merge(df2,  on='A') # Merge the unsorted data
ms = dfs.merge(df2s, on='A') # Merge the sorted data -- substantially faster now

P.S. I wanted to use timeit.timeit() to measure the speed of the two routines, but I kept getting an error like the following:

>>> timeit.timeit( "ms = df1s.merge(df2s, on='A')" )
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/lib/python3.6/timeit.py", line 233, in timeit
    return Timer(stmt, setup, timer, globals).timeit(number)
  File "/opt/conda/lib/python3.6/timeit.py", line 178, in timeit
    timing = self.inner(it, self.timer)
  File "<timeit-src>", line 6, in inner
NameError: name 'df1s' is not defined
like image 309
Jeffrey Benjamin Brown Avatar asked Mar 20 '18 20:03

Jeffrey Benjamin Brown


People also ask

How is pandas merge so fast?

It generates two frames with a million rows each, in random order. Then it generates two more that have been sorted on the first column. Then it merges the first two, and last, merges the second two.

Does order matter for pandas merge?

Answer. Yes. Order of the merged dataframes will effect the order of the rows and columns of the merged dataframe. When using the merge() method, it will preserve the order of the left keys.

How do I keep index from merging pandas?

You can also use DataFrame. join() method to achieve the same thing. The join method will persist the original index. The column to join can be specified with on parameter.

Is pandas merge efficient?

A merge is also just as efficient as a join as long as: Merging is done on indexes if possible. The “on” parameter is avoided, and instead, both columns to merge on are explicitly stated using the keywords left_on, left_index, right_on, and right_index (when applicable).


1 Answers

For starters, a pandas DataFrame is not implemented as a simple multidimensional array. In the code it describes the object as a:

Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects.

Which is pretty complex and I don't expect anyone to wrap their head around that right away.

What this mentions is that it 'can be though of' as a dictionary like object. This mean that it may use some sort of hash map, implying a constant lookup time.

Merging a hash maps is not comparable to merging arrays (which is what merging 2 stacks of papers is doing) as the backend structure is completely different. Therefore sorting would not make a difference.

Unfortunately the connection between a DataFrame and a hash map is not perfect. Hash maps typically are unsorted and cannot have duplicate entries, neither of which match the DataFrame object implementation.

The other possibility, which seems more likely from looking into the code, is that since the merge opperation does not assume that the column is sorted, it goes ahead and sorts the columns itself before applying a more array-like merge. This means that pre-sorting will not make a difference as merging will re-sort the column anyways.

The pandas DataFrame object code can be found here and the merge opperation for the DataFrame's merge opperation can be found here.

like image 154
Elijah Moreau-Arnott Avatar answered Nov 02 '22 17:11

Elijah Moreau-Arnott