Function f(x,y)
that takes two Pandas Series and returns a floating point number. I would like to apply f
to each pair of columns in a DataFrame D
and construct another DataFrame E
of the returned values, so that f(D[i],D[j])
is the value of the i
th row and j
th column. The straightforward solution is to run a nested loop over all pairs of columns:
E = pd.DataFrame([[f(D[i], D[j]) for i in D] for j in D],
columns=D.columns, index=D.columns)
But is there a more elegant solution that perhaps would not involve explicit loops?
NB This question is not a dupe of this, despite the similar names.
EDIT A toy example:
D = pd.DataFrame([[1,2,3], [4,5,6], [7,8,9]], columns=("a","b","c"))
def f(x,y): return x.dot(y)
E
# a b c
#a 66 78 90
#b 78 93 108
#c 90 108 126
Pandas apply() Function to Single & Multiple Column(s) Using pandas. DataFrame. apply() method you can execute a function to a single column, all and list of multiple columns (two or more).
What is the difference between map(), applymap() and apply() methods in pandas? – In padas, all these methods are used to perform either to modify the DataFrame or Series. map() is a method of Series, applymap() is a method of DataFrame, and apply() is defined in both DataFrame and Series.
Return Multiple Columns from pandas apply() You can return a Series from the apply() function that contains the new data. pass axis=1 to the apply() function which applies the function multiply to each row of the DataFrame, Returns a series of multiple columns from pandas apply() function.
Applying a Function on a Pandas Series They are essentially one-dimensional arrays with axis labels called indices. The code above returns the content of the students object and its data type. The students' heights are converted to feet with two decimal places.
You can avoid explicit loops by using Numpy's broadcasting.
Combined with np.vectorize()
and an explicit signature, that gives us the following:
vf = np.vectorize(f, signature='(n),(n)->()')
result = vf(D.T.values, D.T.values[:, None])
Notes:
print(f'x:\n{x}\ny:\n{y}\n')
) in your function, to convince yourself it is doing the right thing.f()
is symmetric; if it is not (e.g. def f(x, y): return np.linalg.norm(x - y**2)
), which argument is extended with an extra dimension for broadcasting matters. With the expression above, you'll get the same result as you r E
. If instead you use result = vf(D.T.values[:, None], D.T.values)
, then you'll get its transpose.df = pd.DataFrame(result, index=D.columns, columns=D.columns)
BTW, if f()
is really the one from your toy example, as I'm sure you already know, you can directly write:
df = D.T.dot(D)
Performance:
Performance-wise, the speed-up using broadcasting and vectorize is roughly 10x (stable over various matrix sizes). By contrast, D.T.dot(D)
is more than 700x faster for size (100, 100), but critically it seems that the relative speedup gets even higher with larger sizes (up to 12,000x faster in my tests, for size (200, 1000) resulting in 1M loops). So, as usual, there is a strong incentive to try and find a way to implement your function f()
using existing numpy function(s)!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With