How to multiply every column of one Pandas Dataframe with every column of another Dataframe efficiently?

Tags:

I'm trying to multiply two pandas dataframes with each other. Specifically, I want to multiply every column with every column of the other df.

The dataframes are one-hot encoded, so they look like this:

col_1, col_2, col_3, ...
 0      1      0 
 1      0      0 
 0      0      1 
 ...

I could just iterate through each of the columns using a for loop, but in python that is computationally expensive, and I'm hoping there's an easier way.

One of the dataframes has 500 columns, the other has 100 columns.

This is the fastest version that I've been able to write so far:

interact_pd = pd.DataFrame(index=df_1.index)
df1_columns = [column for column in df_1]
for column in df_2:
    col_pd = df_1[df1_columns].multiply(df_2[column], axis="index")
    interact_pd = interact_pd.join(col_pd, lsuffix='_' + column)

I iterate over each column in df_2 and multiply all of df_1 by that column, then I append the result to interact_pd. I would rather not do it using a for loop however, as this is very computationally costly. Is there a faster way of doing it?

EDIT: example

df_1:

1col_1, 1col_2, 1col_3
 0      1      0 
 1      0      0 
 0      0      1

df_2:

2col_1, 2col_2
 0      1       
 1      0       
 0      0

interact_pd:

1col_1_2col_1, 1col_2_2col_1,1col_3_2col_1, 1col_1_2col_2, 1col_2_2col_2,1col_3_2col_2

  0      0      0        0       1        0  
  1      0      0        0       0        0 
  0      0      0        0       0        0

790

asked Aug 16 '16 05:08

chris

3 Answers

# use numpy to get a pair of indices that map out every
# combination of columns from df_1 and columns of df_2
pidx = np.indices((df_1.shape[1], df_2.shape[1])).reshape(2, -1)

# use pandas MultiIndex to create a nice MultiIndex for
# the final output
lcol = pd.MultiIndex.from_product([df_1.columns, df_2.columns],
                                  names=[df_1.columns.name, df_2.columns.name])

# df_1.values[:, pidx[0]] slices df_1 values for every combination
# like wise with df_2.values[:, pidx[1]]
# finally, I marry up the product of arrays with the MultiIndex
pd.DataFrame(df_1.values[:, pidx[0]] * df_2.values[:, pidx[1]],
             columns=lcol)

enter image description here

Timing

code

from string import ascii_letters

df_1 = pd.DataFrame(np.random.randint(0, 2, (1000, 26)), columns=list(ascii_letters[:26]))
df_2 = pd.DataFrame(np.random.randint(0, 2, (1000, 52)), columns=list(ascii_letters))

def pir1(df_1, df_2):
    pidx = np.indices((df_1.shape[1], df_2.shape[1])).reshape(2, -1)

    lcol = pd.MultiIndex.from_product([df_1.columns, df_2.columns],
                                      names=[df_1.columns.name, df_2.columns.name])

    return pd.DataFrame(df_1.values[:, pidx[0]] * df_2.values[:, pidx[1]],
                        columns=lcol)

def Test2(DA,DB):
  MA = DA.as_matrix()
  MB = DB.as_matrix()
  MM = np.zeros((len(MA),len(MA[0])*len(MB[0])))
  Col = []
  for i in range(len(MB[0])):
    for j in range(len(MA[0])):
      MM[:,i*len(MA[0])+j] = MA[:,j]*MB[:,i]
      Col.append('1col_'+str(i+1)+'_2col_'+str(j+1))
  return pd.DataFrame(MM,dtype=int,columns=Col)

results

enter image description here

103

answered Nov 14 '22 22:11

piRSquared

You can multiply along the index axis your first df with each column of the second df, this is the fastest method for big datasets (see below):

df = pd.concat([df_1.mul(col[1], axis="index") for col in df_2.iteritems()], axis=1)
# Change the name of the columns
df.columns = ["_".join([i, j]) for j in df_2.columns for i in df_1.columns]
df
       1col_1_2col_1  1col_2_2col_1  1col_3_2col_1  1col_1_2col_2  \
0                  0              0              0              0   
1                  1              0              0              0   
2                  0              0              0              0   

   1col_2_2col_2  1col_3_2col_2  
0              1              0  
1              0              0  
2              0              0

--> See benchmark for comparisons with other answers to choose the best option for your dataset.

Benchmark

Functions:

def Test2(DA,DB):
  MA = DA.as_matrix()
  MB = DB.as_matrix()
  MM = np.zeros((len(MA),len(MA[0])*len(MB[0])))
  Col = []
  for i in range(len(MB[0])):
    for j in range(len(MA[0])):
      MM[:,i*len(MA[0])+j] = MA[:,j]*MB[:,i]
      Col.append('1col_'+str(i+1)+'_2col_'+str(j+1))
  return pd.DataFrame(MM,dtype=int,columns=Col)

def Test3(df_1, df_2):
    df = pd.concat([df_1.mul(i[1], axis="index") for i in df_2.iteritems()], axis=1)
    df.columns = ["_".join([i,j]) for j in df_2.columns for i in df_1.columns]
    return df

def Test4(df_1,df_2):
    pidx = np.indices((df_1.shape[1], df_2.shape[1])).reshape(2, -1)
    lcol = pd.MultiIndex.from_product([df_1.columns, df_2.columns],
                                      names=[df_1.columns.name, df_2.columns.name])
    return pd.DataFrame(df_1.values[:, pidx[0]] * df_2.values[:, pidx[1]],
                 columns=lcol)

def jeanrjc_imp(df_1, df_2):
    df = pd.concat([df_1.mul(‌i[1], axis="index") for i in df_2.iteritems()], axis=1, keys=df_2.columns) 
    return df

Code:

Sorry, ugly code, the plot at the end matters :

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

df_1 = pd.DataFrame(np.random.randint(0, 2, (1000, 600)))
df_2 = pd.DataFrame(np.random.randint(0, 2, (1000, 600)))
df_1.columns = ["1col_"+str(i) for i in range(len(df_1.columns))]
df_2.columns = ["2col_"+str(i) for i in range(len(df_2.columns))]
resa = {}
resb = {}
resc = {}
for f, r in zip([Test2, Test3, Test4, jeanrjc_imp], ["T2", "T3", "T4", "T3bis"]):
        resa[r] = []
        resb[r] = []
        resc[r] = []
        for i in [5, 10, 30, 50, 150, 200]:
             a = %timeit -o f(df_1.iloc[:,:i], df_2.iloc[:, :10])
             b = %timeit -o f(df_1.iloc[:,:i], df_2.iloc[:, :50])
             c = %timeit -o f(df_1.iloc[:,:i], df_2.iloc[:, :200])
             resa[r].append(a.best)
             resb[r].append(b.best)
             resc[r].append(c.best)

X = [5, 10, 30, 50, 150, 200]
fig, ax = plt.subplots(1, 3, figsize=[16,5])
for j, (a, r) in enumerate(zip(ax, [resa, resb, resc])):
    for i in r:
        a.plot(X, r[i], label=i)
        a.set_xlabel("df_1 columns #") 
        a.set_title("df_2 columns # = {}".format(["10", "50", "200"][j]))
ax[0].set_ylabel("time(s)")
plt.legend(loc=0)
plt.tight_layout()

Pandas column multiplication

With T3b <=> jeanrjc_imp. Which is a bit faster that Test3.

Conclusion:

Depending on your dataset size, pick the right function, between Test4 and Test3(b). Given the OP's dataset, Test3 or jeanrjc_imp should be the fastest, and also the shortest to write!

HTH

answered Nov 14 '22 21:11

jrjc

You can use numpy.

Consider this example code, I did modify the variable names, but Test1() is essentially your code. I didn't bother create the correct column names in that function though:

import pandas as pd
import numpy as np

A = [[1,0,1,1],[0,1,1,0],[0,1,0,1]]
B = [[0,0,1,0],[1,0,1,0],[1,1,0,0],[1,0,0,1],[1,0,0,0]]

DA = pd.DataFrame(A).T
DB = pd.DataFrame(B).T

def Test1(DA,DB):
  E = pd.DataFrame(index=DA.index)
  DAC = [column for column in DA]
  for column in DB:
    C = DA[DAC].multiply(DB[column], axis="index")
    E = E.join(C, lsuffix='_' + str(column))
  return E

def Test2(DA,DB):
  MA = DA.as_matrix()
  MB = DB.as_matrix()
  MM = np.zeros((len(MA),len(MA[0])*len(MB[0])))
  Col = []
  for i in range(len(MB[0])):
    for j in range(len(MA[0])):
      MM[:,i*len(MA[0])+j] = MA[:,j]*MB[:,i]
      Col.append('1col_'+str(i+1)+'_2col_'+str(j+1))
  return pd.DataFrame(MM,dtype=int,columns=Col)

print Test1(DA,DB)
print Test2(DA,DB)

Output:

   0_1  1_1  2_1  0  1  2  0_3  1_3  2_3  0  1  2  0  1  2
0    0    0    0  1  0  0    1    0    0  1  0  0  1  0  0
1    0    0    0  0  0  0    0    1    1  0  0  0  0  0  0
2    1    1    0  1  1  0    0    0    0  0  0  0  0  0  0
3    0    0    0  0  0  0    0    0    0  1  0  1  0  0  0
   1col_1_2col_1  1col_1_2col_2  1col_1_2col_3  1col_2_2col_1  1col_2_2col_2  \
0              0              0              0              1              0   
1              0              0              0              0              0   
2              1              1              0              1              1   
3              0              0              0              0              0   

   1col_2_2col_3  1col_3_2col_1  1col_3_2col_2  1col_3_2col_3  1col_4_2col_1  \
0              0              1              0              0              1   
1              0              0              1              1              0   
2              0              0              0              0              0   
3              0              0              0              0              1   

   1col_4_2col_2  1col_4_2col_3  1col_5_2col_1  1col_5_2col_2  1col_5_2col_3  
0              0              0              1              0              0  
1              0              0              0              0              0  
2              0              0              0              0              0  
3              0              1              0              0              0

Performance of your function:

%timeit(Test1(DA,DB))
100 loops, best of 3: 11.1 ms per loop

Performance of my function:

%timeit(Test2(DA,DB))
1000 loops, best of 3: 464 µs per loop

It's not beautiful, but it's efficient.

answered Nov 14 '22 23:11

Khris

Related questions
                            
                                how to get multiple conditional operations after a Pandas groupby?
                            
                                Can no Longer open Spyder IDE for Python Programming
                            
                                how to get img from selenium
                            
                                split bytes variable on newline
                            
                                Tails - Package 'python3-tk' has no installation candidate
                            
                                Read the written list of dictionaries from file in Python
                            
                                How to retrieve value of n-th element in pandas Series object?
                            
                                Pandas KeyError using pivot
                            
                                vim - Youcomplete me unable to find an appropriate Python library
                            
                                How to randomly append "Yes/No" (ratio of 7:3) to a column in pandas dataframe?
                            
                                Custom continuous color map in matplotlib
                            
                                How to use an existing Environment variable in subprocess.Popen()
                            
                                Assign two variables with ternary operator
                            
                                re.search().TypeError: cannot use a string pattern on a bytes-like object
                            
                                Cannot get right slice bound for non-unique label when indexing data frame with python-pandas
                            
                                Filling NAN data with mode() doesn't work -Pandas
                            
                                How to send a list through TCP sockets - Python
                            
                                Adding an integer to all values in a tuple
                            
                                Tkinter Show splash screen and hide main screen until __init__ has finished
                            
                                How Does Deque Work in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With