The contents of this post were originally meant to be a part of Pandas Merging 101, but due to the nature and size of the content required to fully do justice to this topic, it has been moved to its own QnA.
Given two simple DataFrames;
left = pd.DataFrame({'col1' : ['A', 'B', 'C'], 'col2' : [1, 2, 3]}) right = pd.DataFrame({'col1' : ['X', 'Y', 'Z'], 'col2' : [20, 30, 50]}) left col1 col2 0 A 1 1 B 2 2 C 3 right col1 col2 0 X 20 1 Y 30 2 Z 50
The cross product of these frames can be computed, and will look something like:
A 1 X 20 A 1 Y 30 A 1 Z 50 B 2 X 20 B 2 Y 30 B 2 Z 50 C 3 X 20 C 3 Y 30 C 3 Z 50
What is the most performant method of computing this result?
As you can see, the merge is faster than joins, though it is small value, but over 4000 iterations, that small value becomes a huge number, in minutes.
In Pandas, there are parameters to perform left, right, inner or outer merge and join on two DataFrames or Series. However there's no possibility as of now to perform a cross join to merge or join two methods using how="cross" parameter. # merge on that key.
Merge can be used in cases where both the left and right columns are not unique, and therefore cannot be an index. A merge is also just as efficient as a join as long as: Merging is done on indexes if possible.
Numba can be used in 2 ways with pandas: Specify the engine="numba" keyword in select pandas methods. Define your own Python function decorated with @jit and pass the underlying NumPy array of Series or Dataframe (using to_numpy() ) into the function.
Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:
# pandas <= 1.1.X def cartesian_product_basic(left, right): return ( left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1)) cartesian_product_basic(left, right)
# pandas >= 1.2 (est) left.merge(right, how="cross")
col1_x col2_x col1_y col2_y 0 A 1 X 20 1 A 1 Y 30 2 A 1 Z 50 3 B 2 X 20 4 B 2 Y 30 5 B 2 Z 50 6 C 3 X 20 7 C 3 Y 30 8 C 3 Z 50
How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). merge
then performs a many-to-many JOIN on "key".
While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.
A faster implementation will require NumPy. Here are some famous NumPy implementations of 1D cartesian product. We can build on some of these performant solutions to get our desired output. My favourite, however, is @senderle's first implementation.
def cartesian_product(*arrays): la = len(arrays) dtype = np.result_type(*arrays) arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype) for i, a in enumerate(np.ix_(*arrays)): arr[...,i] = a return arr.reshape(-1, la)
Disclaimer
These solutions are optimised for DataFrames with non-mixed scalar dtypes. If dealing with mixed dtypes, use at your own risk!
This trick will work on any kind of DataFrame. We compute the cartesian product of the DataFrames' numeric indices using the aforementioned cartesian_product
, use this to reindex the DataFrames, and
def cartesian_product_generalized(left, right): la, lb = len(left), len(right) idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb]) return pd.DataFrame( np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]])) cartesian_product_generalized(left, right) 0 1 2 3 0 A 1 X 20 1 A 1 Y 30 2 A 1 Z 50 3 B 2 X 20 4 B 2 Y 30 5 B 2 Z 50 6 C 3 X 20 7 C 3 Y 30 8 C 3 Z 50 np.array_equal(cartesian_product_generalized(left, right), cartesian_product_basic(left, right)) True
And, along similar lines,
left2 = left.copy() left2.index = ['s1', 's2', 's1'] right2 = right.copy() right2.index = ['x', 'y', 'y'] left2 col1 col2 s1 A 1 s2 B 2 s1 C 3 right2 col1 col2 x X 20 y Y 30 y Z 50 np.array_equal(cartesian_product_generalized(left, right), cartesian_product_basic(left2, right2)) True
This solution can generalise to multiple DataFrames. For example,
def cartesian_product_multi(*dfs): idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs]) return pd.DataFrame( np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)])) cartesian_product_multi(*[left, right, left]).head() 0 1 2 3 4 5 0 A 1 X 20 A 1 1 A 1 X 20 B 2 2 A 1 X 20 C 3 3 A 1 X 20 D 4 4 A 1 Y 30 A 1
A simpler solution not involving @senderle's cartesian_product
is possible when dealing with just two DataFrames. Using np.broadcast_arrays
, we can achieve almost the same level of performance.
def cartesian_product_simplified(left, right): la, lb = len(left), len(right) ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb]) return pd.DataFrame( np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]])) np.array_equal(cartesian_product_simplified(left, right), cartesian_product_basic(left2, right2)) True
Benchmarking these solutions on some contrived DataFrames with unique indices, we have
Do note that timings may vary based on your setup, data, and choice of cartesian_product
helper function as applicable.
Performance Benchmarking Code
This is the timing script. All functions called here are defined above.
from timeit import timeit import pandas as pd import matplotlib.pyplot as plt res = pd.DataFrame( index=['cartesian_product_basic', 'cartesian_product_generalized', 'cartesian_product_multi', 'cartesian_product_simplified'], columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000], dtype=float ) for f in res.index: for c in res.columns: # print(f,c) left2 = pd.concat([left] * c, ignore_index=True) right2 = pd.concat([right] * c, ignore_index=True) stmt = '{}(left2, right2)'.format(f) setp = 'from __main__ import left2, right2, {}'.format(f) res.at[f, c] = timeit(stmt, setp, number=5) ax = res.div(res.min()).T.plot(loglog=True) ax.set_xlabel("N"); ax.set_ylabel("time (relative)"); plt.show()
Jump to other topics in Pandas Merging 101 to continue learning:
Merging basics - basic types of joins
Index-based joins
Generalizing to multiple DataFrames
Cross join *
* you are here
After pandas 1.2.0 merge
now have option cross
left.merge(right, how='cross')
Using itertools
product
and recreate the value in dataframe
import itertools l=list(itertools.product(left.values.tolist(),right.values.tolist())) pd.DataFrame(list(map(lambda x : sum(x,[]),l))) 0 1 2 3 0 A 1 X 20 1 A 1 Y 30 2 A 1 Z 50 3 B 2 X 20 4 B 2 Y 30 5 B 2 Z 50 6 C 3 X 20 7 C 3 Y 30 8 C 3 Z 50
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With