I have a dataframe where some columns are name-paired (for each column ending with _x there is a corresponding column ending with _y) and others are not. For example:
import pandas as pd
import numpy as np
colnames = [
'foo', 'bar', 'baz',
'a_x', 'b_x', 'c_x',
'a_y', 'b_y', 'c_y',
]
rng = np.random.default_rng(0)
data = rng.random((20, len(colnames)))
df = pd.DataFrame(data, columns=colnames)
Assume I have two lists containing all the column names ending with _x, and all the column names ending with _y (it's easy to build such lists), of the same length m (remember that for each _x column there is one and only one corresponding _y column). I want to create m new columns with a simple formula:
df['a_err'] = (df['a_x'] - df['a_y']) / df['a_y']
without hard-coding the column names, of course. It's easy to do so with a for loop, but I would like to know if it's possible to do the same without a loop, in the hope that it would be faster (the real dataframe is way bigger than this small example).
You can use groupby_apply with a custom function:
func = lambda sr: (sr.iloc[:, 0] - sr.iloc[:, 1]) / sr.iloc[:, 1]
# r'' stands for raw strings like f'' for formatted strings
# Keep columns that end ($) with '_x' or '_y' (_[xy])
# Groupby column prefix (a_x -> a, b_x -> b, ..., c_y -> c)
# Apply your formula on each group (a_x, a_y), (b_x, b_y), (c_x, c_y)
err = (df.filter(regex=r'_[xy]$')
.groupby(lambda x: x.split('_')[0], axis=1)
.apply(func).add_suffix('_err'))
# Append error columns to your original dataframe
df = pd.concat([df, err], axis=1)
Output:
>>> df
foo bar baz a_x b_x c_x a_y b_y c_y a_err b_err c_err
0 0.636962 0.269787 0.040974 0.016528 0.813270 0.912756 0.606636 0.729497 0.543625 -0.972755 0.114838 0.679017
1 0.935072 0.815854 0.002739 0.857404 0.033586 0.729655 0.175656 0.863179 0.541461 3.881166 -0.961091 0.347567
2 0.299712 0.422687 0.028320 0.124283 0.670624 0.647190 0.615385 0.383678 0.997210 -0.798040 0.747885 -0.351000
3 0.980835 0.685542 0.650459 0.688447 0.388921 0.135097 0.721488 0.525354 0.310242 -0.045796 -0.259697 -0.564545
4 0.485835 0.889488 0.934044 0.357795 0.571530 0.321869 0.594300 0.337911 0.391619 -0.397955 0.691361 -0.178106
5 0.890274 0.227158 0.623187 0.084015 0.832644 0.787098 0.239369 0.876484 0.058568 -0.649014 -0.050018 12.439042
6 0.336117 0.150279 0.450339 0.796324 0.230642 0.052021 0.404552 0.198513 0.090753 0.968411 0.161849 -0.426782
7 0.580332 0.298696 0.671995 0.199515 0.942113 0.365110 0.105495 0.629108 0.927155 0.891226 0.497538 -0.606204
8 0.440377 0.954590 0.499896 0.425229 0.620213 0.995097 0.948944 0.460045 0.757729 -0.551893 0.348158 0.313262
9 0.497423 0.529312 0.785786 0.414656 0.734484 0.711143 0.932060 0.114933 0.729015 -0.555119 5.390557 -0.024516
10 0.927424 0.967926 0.014706 0.863640 0.981195 0.957210 0.148764 0.972629 0.889936 4.805437 0.008807 0.075595
11 0.822374 0.479988 0.232373 0.801881 0.923530 0.266130 0.538934 0.442753 0.931017 0.487900 1.085882 -0.714151
12 0.040511 0.732006 0.614373 0.028365 0.719220 0.015992 0.757951 0.512759 0.929104 -0.962576 0.402648 -0.982788
13 0.066082 0.841317 0.066690 0.344310 0.430299 0.966062 0.562232 0.258865 0.241676 -0.387601 0.662254 2.997349
14 0.888118 0.225869 0.124555 0.288331 0.586123 0.554091 0.809711 0.560476 0.288421 -0.643909 0.045760 0.921116
15 0.412896 0.818121 0.626506 0.959078 0.369404 0.552612 0.593924 0.848291 0.145474 0.614815 -0.564531 2.798708
16 0.406510 0.909959 0.043067 0.822706 0.415384 0.829804 0.009955 0.365046 0.078630 81.646166 0.137895 9.553270
17 0.652615 0.273849 0.702652 0.943801 0.126817 0.864778 0.059464 0.380771 0.429774 14.871772 -0.666946 1.012170
18 0.488850 0.976462 0.775691 0.308857 0.269837 0.863120 0.881307 0.510707 0.344296 -0.649546 -0.471640 1.506915
19 0.994917 0.315944 0.182712 0.880098 0.812335 0.667889 0.958414 0.925715 0.748249 -0.081714 -0.122477 -0.107396
You can also use filter to split your columns:
# Python > 3.8, walrus operator
err = (df.filter(regex='_x$').values - (y := df.filter(regex='_y$'))) / y
err.columns = err.columns.str.split('_').str[0] + '_err'
df = pd.concat([df, err], axis=1)
Since there is a balance between x and y, you can limit the for loop to just on the columns, and still be performant. The solution below uses MultiIndex, and pass the bulk of the processing as much as possible to Pandas:
temp = df.filter(like = '_')
temp.columns = temp.columns.str.split("_", expand = True)
temp = temp.swaplevel(axis='columns')
temp = temp.x.sub(temp.y).div(temp.y).add_suffix('_err')
df.assign(**temp)
foo bar baz a_x b_x c_x a_y b_y c_y a_err b_err c_err
0 0.636962 0.269787 0.040974 0.016528 0.813270 0.912756 0.606636 0.729497 0.543625 -0.972755 0.114838 0.679017
1 0.935072 0.815854 0.002739 0.857404 0.033586 0.729655 0.175656 0.863179 0.541461 3.881166 -0.961091 0.347567
2 0.299712 0.422687 0.028320 0.124283 0.670624 0.647190 0.615385 0.383678 0.997210 -0.798040 0.747885 -0.351000
3 0.980835 0.685542 0.650459 0.688447 0.388921 0.135097 0.721488 0.525354 0.310242 -0.045796 -0.259697 -0.564545
4 0.485835 0.889488 0.934044 0.357795 0.571530 0.321869 0.594300 0.337911 0.391619 -0.397955 0.691361 -0.178106
5 0.890274 0.227158 0.623187 0.084015 0.832644 0.787098 0.239369 0.876484 0.058568 -0.649014 -0.050018 12.439042
6 0.336117 0.150279 0.450339 0.796324 0.230642 0.052021 0.404552 0.198513 0.090753 0.968411 0.161849 -0.426782
7 0.580332 0.298696 0.671995 0.199515 0.942113 0.365110 0.105495 0.629108 0.927155 0.891226 0.497538 -0.606204
8 0.440377 0.954590 0.499896 0.425229 0.620213 0.995097 0.948944 0.460045 0.757729 -0.551893 0.348158 0.313262
9 0.497423 0.529312 0.785786 0.414656 0.734484 0.711143 0.932060 0.114933 0.729015 -0.555119 5.390557 -0.024516
10 0.927424 0.967926 0.014706 0.863640 0.981195 0.957210 0.148764 0.972629 0.889936 4.805437 0.008807 0.075595
11 0.822374 0.479988 0.232373 0.801881 0.923530 0.266130 0.538934 0.442753 0.931017 0.487900 1.085882 -0.714151
12 0.040511 0.732006 0.614373 0.028365 0.719220 0.015992 0.757951 0.512759 0.929104 -0.962576 0.402648 -0.982788
13 0.066082 0.841317 0.066690 0.344310 0.430299 0.966062 0.562232 0.258865 0.241676 -0.387601 0.662254 2.997349
14 0.888118 0.225869 0.124555 0.288331 0.586123 0.554091 0.809711 0.560476 0.288421 -0.643909 0.045760 0.921116
15 0.412896 0.818121 0.626506 0.959078 0.369404 0.552612 0.593924 0.848291 0.145474 0.614815 -0.564531 2.798708
16 0.406510 0.909959 0.043067 0.822706 0.415384 0.829804 0.009955 0.365046 0.078630 81.646166 0.137895 9.553270
17 0.652615 0.273849 0.702652 0.943801 0.126817 0.864778 0.059464 0.380771 0.429774 14.871772 -0.666946 1.012170
18 0.488850 0.976462 0.775691 0.308857 0.269837 0.863120 0.881307 0.510707 0.344296 -0.649546 -0.471640 1.506915
19 0.994917 0.315944 0.182712 0.880098 0.812335 0.667889 0.958414 0.925715 0.748249 -0.081714 -0.122477 -0.107396
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With