Performant cartesian product (CROSS JOIN) with pandas

Tags:

The contents of this post were originally meant to be a part of Pandas Merging 101, but due to the nature and size of the content required to fully do justice to this topic, it has been moved to its own QnA.

Given two simple DataFrames;

left = pd.DataFrame({'col1' : ['A', 'B', 'C'], 'col2' : [1, 2, 3]}) right = pd.DataFrame({'col1' : ['X', 'Y', 'Z'], 'col2' : [20, 30, 50]})  left    col1  col2 0    A     1 1    B     2 2    C     3  right    col1  col2 0    X    20 1    Y    30 2    Z    50

The cross product of these frames can be computed, and will look something like:

A       1      X      20 A       1      Y      30 A       1      Z      50 B       2      X      20 B       2      Y      30 B       2      Z      50 C       3      X      20 C       3      Y      30 C       3      Z      50

What is the most performant method of computing this result?

786

asked Dec 10 '18 03:12

cs95

2 Answers

Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:

# pandas <= 1.1.X def cartesian_product_basic(left, right):     return (        left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))  cartesian_product_basic(left, right)

# pandas >= 1.2 (est) left.merge(right, how="cross")

  col1_x  col2_x col1_y  col2_y 0      A       1      X      20 1      A       1      Y      30 2      A       1      Z      50 3      B       2      X      20 4      B       2      Y      30 5      B       2      Z      50 6      C       3      X      20 7      C       3      Y      30 8      C       3      Z      50

How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). merge then performs a many-to-many JOIN on "key".

While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.

A faster implementation will require NumPy. Here are some famous NumPy implementations of 1D cartesian product. We can build on some of these performant solutions to get our desired output. My favourite, however, is @senderle's first implementation.

def cartesian_product(*arrays):     la = len(arrays)     dtype = np.result_type(*arrays)     arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)     for i, a in enumerate(np.ix_(*arrays)):         arr[...,i] = a     return arr.reshape(-1, la)

Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames

Disclaimer
These solutions are optimised for DataFrames with non-mixed scalar dtypes. If dealing with mixed dtypes, use at your own risk!

This trick will work on any kind of DataFrame. We compute the cartesian product of the DataFrames' numeric indices using the aforementioned cartesian_product, use this to reindex the DataFrames, and

def cartesian_product_generalized(left, right):     la, lb = len(left), len(right)     idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])     return pd.DataFrame(         np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))  cartesian_product_generalized(left, right)     0  1  2   3 0  A  1  X  20 1  A  1  Y  30 2  A  1  Z  50 3  B  2  X  20 4  B  2  Y  30 5  B  2  Z  50 6  C  3  X  20 7  C  3  Y  30 8  C  3  Z  50  np.array_equal(cartesian_product_generalized(left, right),                cartesian_product_basic(left, right)) True

And, along similar lines,

left2 = left.copy() left2.index = ['s1', 's2', 's1']  right2 = right.copy() right2.index = ['x', 'y', 'y']       left2    col1  col2 s1    A     1 s2    B     2 s1    C     3  right2   col1  col2 x    X    20 y    Y    30 y    Z    50  np.array_equal(cartesian_product_generalized(left, right),                cartesian_product_basic(left2, right2)) True

This solution can generalise to multiple DataFrames. For example,

def cartesian_product_multi(*dfs):     idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])     return pd.DataFrame(         np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))  cartesian_product_multi(*[left, right, left]).head()     0  1  2   3  4  5 0  A  1  X  20  A  1 1  A  1  X  20  B  2 2  A  1  X  20  C  3 3  A  1  X  20  D  4 4  A  1  Y  30  A  1

Further Simplification

A simpler solution not involving @senderle's cartesian_product is possible when dealing with just two DataFrames. Using np.broadcast_arrays, we can achieve almost the same level of performance.

def cartesian_product_simplified(left, right):     la, lb = len(left), len(right)     ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])      return pd.DataFrame(         np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))  np.array_equal(cartesian_product_simplified(left, right),                cartesian_product_basic(left2, right2)) True

Performance Comparison

Benchmarking these solutions on some contrived DataFrames with unique indices, we have

enter image description here

Do note that timings may vary based on your setup, data, and choice of cartesian_product helper function as applicable.

Performance Benchmarking Code
This is the timing script. All functions called here are defined above.

from timeit import timeit import pandas as pd import matplotlib.pyplot as plt  res = pd.DataFrame(        index=['cartesian_product_basic', 'cartesian_product_generalized',                'cartesian_product_multi', 'cartesian_product_simplified'],        columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],        dtype=float )  for f in res.index:      for c in res.columns:         # print(f,c)         left2 = pd.concat([left] * c, ignore_index=True)         right2 = pd.concat([right] * c, ignore_index=True)         stmt = '{}(left2, right2)'.format(f)         setp = 'from __main__ import left2, right2, {}'.format(f)         res.at[f, c] = timeit(stmt, setp, number=5)  ax = res.div(res.min()).T.plot(loglog=True)  ax.set_xlabel("N");  ax.set_ylabel("time (relative)");  plt.show()

Continue Reading

Jump to other topics in Pandas Merging 101 to continue learning:

Merging basics - basic types of joins
Index-based joins
Generalizing to multiple DataFrames
Cross join ^*

_{* you are here}

110

answered Sep 28 '22 14:09

cs95

After pandas 1.2.0 merge now have option cross

left.merge(right, how='cross')

Using itertools product and recreate the value in dataframe

import itertools l=list(itertools.product(left.values.tolist(),right.values.tolist())) pd.DataFrame(list(map(lambda x : sum(x,[]),l)))    0  1  2   3 0  A  1  X  20 1  A  1  Y  30 2  A  1  Z  50 3  B  2  X  20 4  B  2  Y  30 5  B  2  Z  50 6  C  3  X  20 7  C  3  Y  30 8  C  3  Z  50

answered Sep 28 '22 16:09

BENY

Related questions
                            
                                How to install my own python module (package) via conda and watch its changes
                            
                                Is it okay to pass self to an external function
                            
                                Is Tensorflow compatible with a Windows workflow?
                            
                                What is the difference between armeabi-v7a, arm64-v8a, x86?
                            
                                Converting Exception to a string in Python 3
                            
                                Python argparse and bash completion
                            
                                Trouble passing in lambda to apply for pandas DataFrame
                            
                                preventing python coverage from including virtual environment site packages
                            
                                Running cron python jobs within docker
                            
                                Django 1.11 TypeError context must be a dict rather than Context
                            
                                output the command line called by subprocess?
                            
                                threading ignores KeyboardInterrupt exception
                            
                                directory path types with argparse
                            
                                pandas concat generates nan values
                            
                                Non blocking subprocess.call
                            
                                Flask jsonify a list of objects
                            
                                How to limit the size of a dictionary?
                            
                                Pandas sort by group aggregate and column
                            
                                Python - a bytes like object is required, not str
                            
                                Rolling Mean on pandas on a specific column

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Performant cartesian product (CROSS JOIN) with pandas

Tags:

python

merge

pandas

dataframe

numpy

cs95

People also ask

2 Answers

Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames

Further Simplification

Performance Comparison

Continue Reading

cs95

BENY

Recent Activity

Donate For Us