Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to multiply every column of one Pandas Dataframe with every column of another Dataframe efficiently?

I'm trying to multiply two pandas dataframes with each other. Specifically, I want to multiply every column with every column of the other df.

The dataframes are one-hot encoded, so they look like this:

col_1, col_2, col_3, ...
 0      1      0 
 1      0      0 
 0      0      1 
 ...

I could just iterate through each of the columns using a for loop, but in python that is computationally expensive, and I'm hoping there's an easier way.

One of the dataframes has 500 columns, the other has 100 columns.

This is the fastest version that I've been able to write so far:

interact_pd = pd.DataFrame(index=df_1.index)
df1_columns = [column for column in df_1]
for column in df_2:
    col_pd = df_1[df1_columns].multiply(df_2[column], axis="index")
    interact_pd = interact_pd.join(col_pd, lsuffix='_' + column)

I iterate over each column in df_2 and multiply all of df_1 by that column, then I append the result to interact_pd. I would rather not do it using a for loop however, as this is very computationally costly. Is there a faster way of doing it?

EDIT: example

df_1:

1col_1, 1col_2, 1col_3
 0      1      0 
 1      0      0 
 0      0      1 

df_2:

2col_1, 2col_2
 0      1       
 1      0       
 0      0      

interact_pd:

1col_1_2col_1, 1col_2_2col_1,1col_3_2col_1, 1col_1_2col_2, 1col_2_2col_2,1col_3_2col_2

  0      0      0        0       1        0  
  1      0      0        0       0        0 
  0      0      0        0       0        0 
like image 790
chris Avatar asked Aug 16 '16 05:08

chris


People also ask

How do I multiply all columns in pandas?

Use the * operator to multiply a column by a constant number Select a column of DataFrame df using syntax df["column_name"] and set it equal to n * df["column_name"] where n is the number to multiply by.

How can you speed up computations with pandas?

One solution (whether or not it's possible to vectorize calculations) is to convert your calculations to NumPy. Numpy has all of the computation capabilities of Pandas, but performs them without carrying as much overhead information while also using precompiled, optimized methods.

How do you multiply all values in a data frame?

The mul() method multiplies each value in the DataFrame with a specified value. The specified value must be an object that can be multiplied with the values of the DataFrame.


3 Answers

# use numpy to get a pair of indices that map out every
# combination of columns from df_1 and columns of df_2
pidx = np.indices((df_1.shape[1], df_2.shape[1])).reshape(2, -1)

# use pandas MultiIndex to create a nice MultiIndex for
# the final output
lcol = pd.MultiIndex.from_product([df_1.columns, df_2.columns],
                                  names=[df_1.columns.name, df_2.columns.name])

# df_1.values[:, pidx[0]] slices df_1 values for every combination
# like wise with df_2.values[:, pidx[1]]
# finally, I marry up the product of arrays with the MultiIndex
pd.DataFrame(df_1.values[:, pidx[0]] * df_2.values[:, pidx[1]],
             columns=lcol)

enter image description here


Timing

code

from string import ascii_letters

df_1 = pd.DataFrame(np.random.randint(0, 2, (1000, 26)), columns=list(ascii_letters[:26]))
df_2 = pd.DataFrame(np.random.randint(0, 2, (1000, 52)), columns=list(ascii_letters))

def pir1(df_1, df_2):
    pidx = np.indices((df_1.shape[1], df_2.shape[1])).reshape(2, -1)

    lcol = pd.MultiIndex.from_product([df_1.columns, df_2.columns],
                                      names=[df_1.columns.name, df_2.columns.name])

    return pd.DataFrame(df_1.values[:, pidx[0]] * df_2.values[:, pidx[1]],
                        columns=lcol)

def Test2(DA,DB):
  MA = DA.as_matrix()
  MB = DB.as_matrix()
  MM = np.zeros((len(MA),len(MA[0])*len(MB[0])))
  Col = []
  for i in range(len(MB[0])):
    for j in range(len(MA[0])):
      MM[:,i*len(MA[0])+j] = MA[:,j]*MB[:,i]
      Col.append('1col_'+str(i+1)+'_2col_'+str(j+1))
  return pd.DataFrame(MM,dtype=int,columns=Col)

results

enter image description here

like image 103
piRSquared Avatar answered Nov 14 '22 22:11

piRSquared


You can multiply along the index axis your first df with each column of the second df, this is the fastest method for big datasets (see below):

df = pd.concat([df_1.mul(col[1], axis="index") for col in df_2.iteritems()], axis=1)
# Change the name of the columns
df.columns = ["_".join([i, j]) for j in df_2.columns for i in df_1.columns]
df
       1col_1_2col_1  1col_2_2col_1  1col_3_2col_1  1col_1_2col_2  \
0                  0              0              0              0   
1                  1              0              0              0   
2                  0              0              0              0   

   1col_2_2col_2  1col_3_2col_2  
0              1              0  
1              0              0  
2              0              0  

--> See benchmark for comparisons with other answers to choose the best option for your dataset.


Benchmark

Functions:

def Test2(DA,DB):
  MA = DA.as_matrix()
  MB = DB.as_matrix()
  MM = np.zeros((len(MA),len(MA[0])*len(MB[0])))
  Col = []
  for i in range(len(MB[0])):
    for j in range(len(MA[0])):
      MM[:,i*len(MA[0])+j] = MA[:,j]*MB[:,i]
      Col.append('1col_'+str(i+1)+'_2col_'+str(j+1))
  return pd.DataFrame(MM,dtype=int,columns=Col)

def Test3(df_1, df_2):
    df = pd.concat([df_1.mul(i[1], axis="index") for i in df_2.iteritems()], axis=1)
    df.columns = ["_".join([i,j]) for j in df_2.columns for i in df_1.columns]
    return df

def Test4(df_1,df_2):
    pidx = np.indices((df_1.shape[1], df_2.shape[1])).reshape(2, -1)
    lcol = pd.MultiIndex.from_product([df_1.columns, df_2.columns],
                                      names=[df_1.columns.name, df_2.columns.name])
    return pd.DataFrame(df_1.values[:, pidx[0]] * df_2.values[:, pidx[1]],
                 columns=lcol)

def jeanrjc_imp(df_1, df_2):
    df = pd.concat([df_1.mul(‌​i[1], axis="index") for i in df_2.iteritems()], axis=1, keys=df_2.columns) 
    return df

Code:

Sorry, ugly code, the plot at the end matters :

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

df_1 = pd.DataFrame(np.random.randint(0, 2, (1000, 600)))
df_2 = pd.DataFrame(np.random.randint(0, 2, (1000, 600)))
df_1.columns = ["1col_"+str(i) for i in range(len(df_1.columns))]
df_2.columns = ["2col_"+str(i) for i in range(len(df_2.columns))]
resa = {}
resb = {}
resc = {}
for f, r in zip([Test2, Test3, Test4, jeanrjc_imp], ["T2", "T3", "T4", "T3bis"]):
        resa[r] = []
        resb[r] = []
        resc[r] = []
        for i in [5, 10, 30, 50, 150, 200]:
             a = %timeit -o f(df_1.iloc[:,:i], df_2.iloc[:, :10])
             b = %timeit -o f(df_1.iloc[:,:i], df_2.iloc[:, :50])
             c = %timeit -o f(df_1.iloc[:,:i], df_2.iloc[:, :200])
             resa[r].append(a.best)
             resb[r].append(b.best)
             resc[r].append(c.best)

X = [5, 10, 30, 50, 150, 200]
fig, ax = plt.subplots(1, 3, figsize=[16,5])
for j, (a, r) in enumerate(zip(ax, [resa, resb, resc])):
    for i in r:
        a.plot(X, r[i], label=i)
        a.set_xlabel("df_1 columns #") 
        a.set_title("df_2 columns # = {}".format(["10", "50", "200"][j]))
ax[0].set_ylabel("time(s)")
plt.legend(loc=0)
plt.tight_layout()

Pandas column multiplication

With T3b <=> jeanrjc_imp. Which is a bit faster that Test3.

Conclusion:

Depending on your dataset size, pick the right function, between Test4 and Test3(b). Given the OP's dataset, Test3 or jeanrjc_imp should be the fastest, and also the shortest to write!

HTH

like image 44
jrjc Avatar answered Nov 14 '22 21:11

jrjc


You can use numpy.

Consider this example code, I did modify the variable names, but Test1() is essentially your code. I didn't bother create the correct column names in that function though:

import pandas as pd
import numpy as np

A = [[1,0,1,1],[0,1,1,0],[0,1,0,1]]
B = [[0,0,1,0],[1,0,1,0],[1,1,0,0],[1,0,0,1],[1,0,0,0]]

DA = pd.DataFrame(A).T
DB = pd.DataFrame(B).T

def Test1(DA,DB):
  E = pd.DataFrame(index=DA.index)
  DAC = [column for column in DA]
  for column in DB:
    C = DA[DAC].multiply(DB[column], axis="index")
    E = E.join(C, lsuffix='_' + str(column))
  return E

def Test2(DA,DB):
  MA = DA.as_matrix()
  MB = DB.as_matrix()
  MM = np.zeros((len(MA),len(MA[0])*len(MB[0])))
  Col = []
  for i in range(len(MB[0])):
    for j in range(len(MA[0])):
      MM[:,i*len(MA[0])+j] = MA[:,j]*MB[:,i]
      Col.append('1col_'+str(i+1)+'_2col_'+str(j+1))
  return pd.DataFrame(MM,dtype=int,columns=Col)

print Test1(DA,DB)
print Test2(DA,DB)

Output:

   0_1  1_1  2_1  0  1  2  0_3  1_3  2_3  0  1  2  0  1  2
0    0    0    0  1  0  0    1    0    0  1  0  0  1  0  0
1    0    0    0  0  0  0    0    1    1  0  0  0  0  0  0
2    1    1    0  1  1  0    0    0    0  0  0  0  0  0  0
3    0    0    0  0  0  0    0    0    0  1  0  1  0  0  0
   1col_1_2col_1  1col_1_2col_2  1col_1_2col_3  1col_2_2col_1  1col_2_2col_2  \
0              0              0              0              1              0   
1              0              0              0              0              0   
2              1              1              0              1              1   
3              0              0              0              0              0   

   1col_2_2col_3  1col_3_2col_1  1col_3_2col_2  1col_3_2col_3  1col_4_2col_1  \
0              0              1              0              0              1   
1              0              0              1              1              0   
2              0              0              0              0              0   
3              0              0              0              0              1   

   1col_4_2col_2  1col_4_2col_3  1col_5_2col_1  1col_5_2col_2  1col_5_2col_3  
0              0              0              1              0              0  
1              0              0              0              0              0  
2              0              0              0              0              0  
3              0              1              0              0              0  

Performance of your function:

%timeit(Test1(DA,DB))
100 loops, best of 3: 11.1 ms per loop

Performance of my function:

%timeit(Test2(DA,DB))
1000 loops, best of 3: 464 µs per loop

It's not beautiful, but it's efficient.

like image 34
Khris Avatar answered Nov 14 '22 23:11

Khris