Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas outer product of two dataframes with same index

Consider the following dataframes d1 and d1

d1 = pd.DataFrame([
    [1, 2, 3],
    [2, 3, 4],
    [3, 4, 5],
    [1, 2, 3],
    [2, 3, 4],
    [3, 4, 5]
], columns=list('ABC'))

d2 = pd.get_dummies(list('XYZZXY'))

d1

   A  B  C
0  1  2  3
1  2  3  4
2  3  4  5
3  1  2  3
4  2  3  4
5  3  4  5

d2

   X  Y  Z
0  1  0  0
1  0  1  0
2  0  0  1
3  0  0  1
4  1  0  0
5  0  1  0

I need to get a new dataframe with a multi-index columns object that has the product of every combination of columns from d1 and d2


So far I've done this...

from itertools import product
pd.concat({(x, y): d1[x] * d2[y] for x, y in product(d1, d2)}, axis=1)

   A        B        C      
   X  Y  Z  X  Y  Z  X  Y  Z
0  1  0  0  2  0  0  3  0  0
1  0  2  0  0  3  0  0  4  0
2  0  0  3  0  0  4  0  0  5
3  0  0  1  0  0  2  0  0  3
4  2  0  0  3  0  0  4  0  0
5  0  3  0  0  4  0  0  5  0

There is nothing wrong with this method. But I'm looking for alternatives to evaluate.


Inspired by Yakym Pirozhenko

m, n = len(d1.columns), len(d2.columns)
lvl0 = np.repeat(np.arange(m), n)
lvl1 = np.tile(np.arange(n), m)
v1, v2 = d1.values, d2.values

pd.DataFrame(
    v1[:, lvl0] * v2[:, lvl1],
    d1.index,
    pd.MultiIndex.from_tuples(list(zip(d1.columns[lvl0], d2.columns[lvl1])))
)

However, this is a more clumsy implementation of numpy broadcasting which is better covered by Divakar.

Timing
All answers were good answers and demonstrate different aspects of pandas and numpy. Please consider up-voting them if you found them useful and informative.

%%timeit
m, n = len(d1.columns), len(d2.columns)
lvl0 = np.repeat(np.arange(m), n)
lvl1 = np.tile(np.arange(n), m)
v1, v2 = d1.values, d2.values

pd.DataFrame(
    v1[:, lvl0] * v2[:, lvl1],
    d1.index,
    pd.MultiIndex.from_tuples(list(zip(d1.columns[lvl0], d2.columns[lvl1])))
)

%%timeit 
vals = (d2.values[:,None,:] * d1.values[:,:,None]).reshape(d1.shape[0],-1)
cols = pd.MultiIndex.from_product([d1.columns, d2.columns])
pd.DataFrame(vals, columns=cols, index=d1.index)

%timeit d1.apply(lambda x: d2.mul(x, axis=0).stack()).unstack()
%timeit pd.concat({x : d2.mul(d1[x], axis=0) for x in d1.columns}, axis=1)
%timeit pd.concat({(x, y): d1[x] * d2[y] for x, y in product(d1, d2)}, axis=1)

1000 loops, best of 3: 663 µs per loop
1000 loops, best of 3: 624 µs per loop
100 loops, best of 3: 3.38 ms per loop
1000 loops, best of 3: 860 µs per loop
100 loops, best of 3: 2.01 ms per loop
like image 851
piRSquared Avatar asked Dec 19 '22 05:12

piRSquared


1 Answers

Here is a one-liner that uses pandas stack and unstack method. The "trick" is to use stack, so that the result of each computation within apply is a time series. Then use unstack to obtain the Multiindex form.

d1.apply(lambda x: d2.mul(x, axis=0).stack()).unstack()

Which gives:

     A              B              C          
     X    Y    Z    X    Y    Z    X    Y    Z
0  1.0  0.0  0.0  2.0  0.0  0.0  3.0  0.0  0.0
1  0.0  2.0  0.0  0.0  3.0  0.0  0.0  4.0  0.0
2  0.0  0.0  3.0  0.0  0.0  4.0  0.0  0.0  5.0
3  0.0  0.0  1.0  0.0  0.0  2.0  0.0  0.0  3.0
4  2.0  0.0  0.0  3.0  0.0  0.0  4.0  0.0  0.0
5  0.0  3.0  0.0  0.0  4.0  0.0  0.0  5.0  0.0
like image 157
FLab Avatar answered Jan 05 '23 18:01

FLab