Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Apply function to each pair of columns

Tags:

python

pandas

Function f(x,y) that takes two Pandas Series and returns a floating point number. I would like to apply f to each pair of columns in a DataFrame D and construct another DataFrame E of the returned values, so that f(D[i],D[j]) is the value of the ith row and jth column. The straightforward solution is to run a nested loop over all pairs of columns:

E = pd.DataFrame([[f(D[i], D[j]) for i in D] for j in D],
                 columns=D.columns, index=D.columns)

But is there a more elegant solution that perhaps would not involve explicit loops?

NB This question is not a dupe of this, despite the similar names.

EDIT A toy example:

D = pd.DataFrame([[1,2,3], [4,5,6], [7,8,9]], columns=("a","b","c"))
def f(x,y): return x.dot(y)

E
#    a    b    c
#a  66   78   90
#b  78   93  108
#c  90  108  126
like image 884
DYZ Avatar asked Sep 20 '17 05:09

DYZ


People also ask

How do I apply a function to multiple columns in pandas?

Pandas apply() Function to Single & Multiple Column(s) Using pandas. DataFrame. apply() method you can execute a function to a single column, all and list of multiple columns (two or more).

What is the difference between apply and Applymap in pandas?

What is the difference between map(), applymap() and apply() methods in pandas? – In padas, all these methods are used to perform either to modify the DataFrame or Series. map() is a method of Series, applymap() is a method of DataFrame, and apply() is defined in both DataFrame and Series.

Can pandas apply return two columns?

Return Multiple Columns from pandas apply() You can return a Series from the apply() function that contains the new data. pass axis=1 to the apply() function which applies the function multiply to each row of the DataFrame, Returns a series of multiple columns from pandas apply() function.

How does apply function work in pandas?

Applying a Function on a Pandas Series They are essentially one-dimensional arrays with axis labels called indices. The code above returns the content of the students object and its data type. The students' heights are converted to feet with two decimal places.


1 Answers

You can avoid explicit loops by using Numpy's broadcasting.

Combined with np.vectorize() and an explicit signature, that gives us the following:

vf = np.vectorize(f, signature='(n),(n)->()')
result = vf(D.T.values, D.T.values[:, None])

Notes:

  1. you can add some print statement (e.g. print(f'x:\n{x}\ny:\n{y}\n')) in your function, to convince yourself it is doing the right thing.
  2. you function f() is symmetric; if it is not (e.g. def f(x, y): return np.linalg.norm(x - y**2)), which argument is extended with an extra dimension for broadcasting matters. With the expression above, you'll get the same result as you r E. If instead you use result = vf(D.T.values[:, None], D.T.values), then you'll get its transpose.
  3. the result is a numpy array, of course, and if you want it back as a DataFrame, add:
df = pd.DataFrame(result, index=D.columns, columns=D.columns)

BTW, if f() is really the one from your toy example, as I'm sure you already know, you can directly write:

df = D.T.dot(D)

Performance:

Performance-wise, the speed-up using broadcasting and vectorize is roughly 10x (stable over various matrix sizes). By contrast, D.T.dot(D) is more than 700x faster for size (100, 100), but critically it seems that the relative speedup gets even higher with larger sizes (up to 12,000x faster in my tests, for size (200, 1000) resulting in 1M loops). So, as usual, there is a strong incentive to try and find a way to implement your function f() using existing numpy function(s)!

like image 175
Pierre D Avatar answered Oct 29 '22 01:10

Pierre D