Function <code>f(x,y)</code> that takes two Pandas Series and returns a floating point number. I would like to apply <code>f</code> to each pair of columns in a DataFrame <code>D</code> and construct another DataFrame <code>E</code> of the returned values, so that <code>f(D[i],D[j])</code> is the value of the <code>i</code>th row and <code>j</code>th column. The straightforward solution is to run a nested loop over all pairs of columns: <pre class="prettyprint"><code>E = pd.DataFrame([[f(D[i], D[j]) for i in D] for j in D], columns=D.columns, index=D.columns) </code></pre> But is there a more elegant solution that perhaps would not involve explicit loops? NB This question is not a dupe of this, despite the similar names. EDIT A toy example: <pre class="prettyprint"><code>D = pd.DataFrame([[1,2,3], [4,5,6], [7,8,9]], columns=("a","b","c")) def f(x,y): return x.dot(y) E # a b c #a 66 78 90 #b 78 93 108 #c 90 108 126 </code></pre>

You can avoid explicit loops by using Numpy's broadcasting. Combined with <code>np.vectorize()</code> and an explicit signature, that gives us the following: <pre class="prettyprint"><code>vf = np.vectorize(f, signature='(n),(n)->()') result = vf(D.T.values, D.T.values[:, None]) </code></pre> Notes: <ol> <li>you can add some print statement (e.g. <code>print(f'x:\n{x}\ny:\n{y}\n')</code>) in your function, to convince yourself it is doing the right thing.</li> <li>you function <code>f()</code> is symmetric; if it is not (e.g. <code>def f(x, y): return np.linalg.norm(x - y**2)</code>), which argument is extended with an extra dimension for broadcasting matters. With the expression above, you'll get the same result as you r <code>E</code>. If instead you use <code>result = vf(D.T.values[:, None], D.T.values)</code>, then you'll get its transpose.</li> <li>the result is a numpy array, of course, and if you want it back as a DataFrame, add:</li> </ol> <pre class="prettyprint"><code>df = pd.DataFrame(result, index=D.columns, columns=D.columns) </code></pre> BTW, if <code>f()</code> is really the one from your toy example, as I'm sure you already know, you can directly write: <pre class="prettyprint"><code>df = D.T.dot(D) </code></pre> Performance: Performance-wise, the speed-up using broadcasting and vectorize is roughly 10x (stable over various matrix sizes). By contrast, <code>D.T.dot(D)</code> is more than 700x faster for size (100, 100), but critically it seems that the relative speedup gets even higher with larger sizes (up to 12,000x faster in my tests, for size (200, 1000) resulting in 1M loops). So, as usual, there is a strong incentive to try and find a way to implement your function <code>f()</code> using existing numpy function(s)!

Pandas: Apply function to each pair of columns

Tags:

python

pandas

Function f(x,y) that takes two Pandas Series and returns a floating point number. I would like to apply f to each pair of columns in a DataFrame D and construct another DataFrame E of the returned values, so that f(D[i],D[j]) is the value of the ith row and jth column. The straightforward solution is to run a nested loop over all pairs of columns:

Click to copy

E = pd.DataFrame([[f(D[i], D[j]) for i in D] for j in D],
                 columns=D.columns, index=D.columns)

But is there a more elegant solution that perhaps would not involve explicit loops?

NB This question is not a dupe of this, despite the similar names.

EDIT A toy example:

Click to copy

D = pd.DataFrame([[1,2,3], [4,5,6], [7,8,9]], columns=("a","b","c"))
def f(x,y): return x.dot(y)

E
#    a    b    c
#a  66   78   90
#b  78   93  108
#c  90  108  126

884

asked Sep 20 '17 05:09

DYZ

1 Answers

You can avoid explicit loops by using Numpy's broadcasting.

Combined with np.vectorize() and an explicit signature, that gives us the following:

Click to copy

vf = np.vectorize(f, signature='(n),(n)->()')
result = vf(D.T.values, D.T.values[:, None])

Notes:

you can add some print statement (e.g. print(f'x:\n{x}\ny:\n{y}\n')) in your function, to convince yourself it is doing the right thing.
you function f() is symmetric; if it is not (e.g. def f(x, y): return np.linalg.norm(x - y**2)), which argument is extended with an extra dimension for broadcasting matters. With the expression above, you'll get the same result as you r E. If instead you use result = vf(D.T.values[:, None], D.T.values), then you'll get its transpose.
the result is a numpy array, of course, and if you want it back as a DataFrame, add:

Click to copy

df = pd.DataFrame(result, index=D.columns, columns=D.columns)

BTW, if f() is really the one from your toy example, as I'm sure you already know, you can directly write:

Click to copy

df = D.T.dot(D)

Performance:

Performance-wise, the speed-up using broadcasting and vectorize is roughly 10x (stable over various matrix sizes). By contrast, D.T.dot(D) is more than 700x faster for size (100, 100), but critically it seems that the relative speedup gets even higher with larger sizes (up to 12,000x faster in my tests, for size (200, 1000) resulting in 1M loops). So, as usual, there is a strong incentive to try and find a way to implement your function f() using existing numpy function(s)!

175

answered Oct 29 '22 01:10

Pierre D

Related questions
                            
                                SHA Hashing for training/validation/testing set split
                            
                                OpenCV3 error: "Unable to stop the stream: Inappropriate ioctl for device"
                            
                                Making sure a message published on a topic exchange is received by at least one consumer
                            
                                How to do z transform using python sympy?
                            
                                Pyspark - Load trained model word2vec
                            
                                PyInstaller: how to create multiple programs in one folder?
                            
                                PyQt - Make QAction checkable even if it is disabled
                            
                                Configure net.core.somaxconn for Nodes on GKE
                            
                                How to fully delete a turtle
                            
                                Is there a DropConnect layer in Keras? [closed]
                            
                                Pandas DataFrame slow to show shape or dtypes
                            
                                Correctly loading Keras model in Django that supports multi-tenancy
                            
                                Surprising behaviour of Python date and timedelta subtraction
                            
                                trying to make OpenCV 3.2.0 work with virtualenv
                            
                                Python 3 -- Module not found
                            
                                Matplotlib -- libpng error: Incompatible libpng version in application and library
                            
                                Shorten large stack traces when using libraries
                            
                                Python Logging Error
                            
                                keras validation_data with multiple input
                            
                                How to test that a custom excepthook is installed correctly?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas: Apply function to each pair of columns

Tags:

python

pandas

DYZ

People also ask

1 Answers

Pierre D

Recent Activity

Donate For Us