I was merging two data sets in pandas, and wanted to speed the process up, so I sorted both of them on the column that was being used for merge. (Previously those columns had been not at all ordered.) Sorting caused no discernible speed difference; both took about eight seconds.
If I was manually merging two stacks of paper based on, say, their page number, I would first sort each of them by page number. Otherwise I'd have to do a lot of flipping back and forth between the stacks.
I wrote a test to compare the two processes. It generates two frames with a million rows each, in random order. Then it generates two more that have been sorted on the first column. Then it merges the first two, and last, merges the second two.
The data generation process was so slow that I don't have time to try more rows -- but the merge nonetheless happens in zero perceptible time, even without sorting.
import pandas as pd
import numpy as np
def shuffle(df, n=1, axis=0):
""" https://stackoverflow.com/questions/15772009/shuffling-permutating-a-dataframe-in-pandas """
df = df.copy()
for _ in range(n):
df.apply(np.random.shuffle, axis=axis)
return df
# Create some shuffled data
df1 = pd.DataFrame({'A':range(1000000), 'B':range(1000000)})
df2 = pd.DataFrame({'A':range(1000000), 'B':range(1000000)})
df1 = shuffle(df1)
df2 = shuffle(df2)
# Sort that data on column A
df1s = df1.sort_values(by='A')
df2s = df2.sort_values(by='A')
m = df1. merge(df2, on='A') # Merge the unsorted data
ms = df1s.merge(df2s, on='A') # Merge the sorted data
EDIT: I wrote a test with data that's 50 times wider and 1/5 as tall, and now the sorting seems to help?
import pandas as pd
import numpy as np
def shuffle(df, n=1, axis=0):
""" https://stackoverflow.com/questions/15772009/shuffling-permutating-a-dataframe-in-pandas """
df = df.copy()
for _ in range(n):
df.apply(np.random.shuffle, axis=axis)
return df
# Create some shuffled data
nrows = 200000
reorderedIntegers = shuffle( pd.DataFrame({ 'A':range(nrows) }) )
reorderedIntegers2 = shuffle( pd.DataFrame({ 'A':range(nrows) }) )
# Widen it
extraColumns = pd.DataFrame( np.random.rand( nrows, 100 ) )
df = pd.concat( [reorderedIntegers, extraColumns], axis=1 )
df2 = pd.concat( [reorderedIntegers2, extraColumns], axis=1 )
# Create sorted copies
dfs = df .sort_values(by='A')
dfs2 = df2.sort_values(by='A')
# Compare merge speeds
m = df. merge(df2, on='A') # Merge the unsorted data
ms = dfs.merge(df2s, on='A') # Merge the sorted data -- substantially faster now
P.S. I wanted to use timeit.timeit() to measure the speed of the two routines, but I kept getting an error like the following:
>>> timeit.timeit( "ms = df1s.merge(df2s, on='A')" )
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/conda/lib/python3.6/timeit.py", line 233, in timeit
return Timer(stmt, setup, timer, globals).timeit(number)
File "/opt/conda/lib/python3.6/timeit.py", line 178, in timeit
timing = self.inner(it, self.timer)
File "<timeit-src>", line 6, in inner
NameError: name 'df1s' is not defined
It generates two frames with a million rows each, in random order. Then it generates two more that have been sorted on the first column. Then it merges the first two, and last, merges the second two.
Answer. Yes. Order of the merged dataframes will effect the order of the rows and columns of the merged dataframe. When using the merge() method, it will preserve the order of the left keys.
You can also use DataFrame. join() method to achieve the same thing. The join method will persist the original index. The column to join can be specified with on parameter.
A merge is also just as efficient as a join as long as: Merging is done on indexes if possible. The “on” parameter is avoided, and instead, both columns to merge on are explicitly stated using the keywords left_on, left_index, right_on, and right_index (when applicable).
For starters, a pandas DataFrame is not implemented as a simple multidimensional array. In the code it describes the object as a:
Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects.
Which is pretty complex and I don't expect anyone to wrap their head around that right away.
What this mentions is that it 'can be though of' as a dictionary like object. This mean that it may use some sort of hash map, implying a constant lookup time.
Merging a hash maps is not comparable to merging arrays (which is what merging 2 stacks of papers is doing) as the backend structure is completely different. Therefore sorting would not make a difference.
Unfortunately the connection between a DataFrame and a hash map is not perfect. Hash maps typically are unsorted and cannot have duplicate entries, neither of which match the DataFrame object implementation.
The other possibility, which seems more likely from looking into the code, is that since the merge opperation does not assume that the column is sorted, it goes ahead and sorts the columns itself before applying a more array-like merge. This means that pre-sorting will not make a difference as merging will re-sort the column anyways.
The pandas DataFrame object code can be found here and the merge opperation for the DataFrame's merge opperation can be found here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With