Compute m columns from 2m columns without a for loop

Question

I have a dataframe where some columns are name-paired (for each column ending with _x there is a corresponding column ending with _y) and others are not. For example:

import pandas as pd
import numpy as np

colnames = [
    'foo', 'bar', 'baz',
    'a_x', 'b_x', 'c_x',
    'a_y', 'b_y', 'c_y',
]
rng = np.random.default_rng(0)
data = rng.random((20, len(colnames)))
df = pd.DataFrame(data, columns=colnames)

Assume I have two lists containing all the column names ending with _x, and all the column names ending with _y (it's easy to build such lists), of the same length m (remember that for each _x column there is one and only one corresponding _y column). I want to create m new columns with a simple formula:

df['a_err'] = (df['a_x'] - df['a_y']) / df['a_y']

without hard-coding the column names, of course. It's easy to do so with a for loop, but I would like to know if it's possible to do the same without a loop, in the hope that it would be faster (the real dataframe is way bigger than this small example).

Corralien · Accepted Answer

You can use groupby_apply with a custom function:

func = lambda sr: (sr.iloc[:, 0] - sr.iloc[:, 1]) / sr.iloc[:, 1]

# r'' stands for raw strings like f'' for formatted strings
# Keep columns that end ($) with '_x' or '_y' (_[xy])
# Groupby column prefix (a_x -> a, b_x -> b, ..., c_y -> c)
# Apply your formula on each group (a_x, a_y), (b_x, b_y), (c_x, c_y)
err = (df.filter(regex=r'_[xy]$')
         .groupby(lambda x: x.split('_')[0], axis=1)
         .apply(func).add_suffix('_err'))

# Append error columns to your original dataframe
df = pd.concat([df, err], axis=1)

Output:

>>> df
         foo       bar       baz       a_x       b_x       c_x       a_y       b_y       c_y      a_err     b_err      c_err
0   0.636962  0.269787  0.040974  0.016528  0.813270  0.912756  0.606636  0.729497  0.543625  -0.972755  0.114838   0.679017
1   0.935072  0.815854  0.002739  0.857404  0.033586  0.729655  0.175656  0.863179  0.541461   3.881166 -0.961091   0.347567
2   0.299712  0.422687  0.028320  0.124283  0.670624  0.647190  0.615385  0.383678  0.997210  -0.798040  0.747885  -0.351000
3   0.980835  0.685542  0.650459  0.688447  0.388921  0.135097  0.721488  0.525354  0.310242  -0.045796 -0.259697  -0.564545
4   0.485835  0.889488  0.934044  0.357795  0.571530  0.321869  0.594300  0.337911  0.391619  -0.397955  0.691361  -0.178106
5   0.890274  0.227158  0.623187  0.084015  0.832644  0.787098  0.239369  0.876484  0.058568  -0.649014 -0.050018  12.439042
6   0.336117  0.150279  0.450339  0.796324  0.230642  0.052021  0.404552  0.198513  0.090753   0.968411  0.161849  -0.426782
7   0.580332  0.298696  0.671995  0.199515  0.942113  0.365110  0.105495  0.629108  0.927155   0.891226  0.497538  -0.606204
8   0.440377  0.954590  0.499896  0.425229  0.620213  0.995097  0.948944  0.460045  0.757729  -0.551893  0.348158   0.313262
9   0.497423  0.529312  0.785786  0.414656  0.734484  0.711143  0.932060  0.114933  0.729015  -0.555119  5.390557  -0.024516
10  0.927424  0.967926  0.014706  0.863640  0.981195  0.957210  0.148764  0.972629  0.889936   4.805437  0.008807   0.075595
11  0.822374  0.479988  0.232373  0.801881  0.923530  0.266130  0.538934  0.442753  0.931017   0.487900  1.085882  -0.714151
12  0.040511  0.732006  0.614373  0.028365  0.719220  0.015992  0.757951  0.512759  0.929104  -0.962576  0.402648  -0.982788
13  0.066082  0.841317  0.066690  0.344310  0.430299  0.966062  0.562232  0.258865  0.241676  -0.387601  0.662254   2.997349
14  0.888118  0.225869  0.124555  0.288331  0.586123  0.554091  0.809711  0.560476  0.288421  -0.643909  0.045760   0.921116
15  0.412896  0.818121  0.626506  0.959078  0.369404  0.552612  0.593924  0.848291  0.145474   0.614815 -0.564531   2.798708
16  0.406510  0.909959  0.043067  0.822706  0.415384  0.829804  0.009955  0.365046  0.078630  81.646166  0.137895   9.553270
17  0.652615  0.273849  0.702652  0.943801  0.126817  0.864778  0.059464  0.380771  0.429774  14.871772 -0.666946   1.012170
18  0.488850  0.976462  0.775691  0.308857  0.269837  0.863120  0.881307  0.510707  0.344296  -0.649546 -0.471640   1.506915
19  0.994917  0.315944  0.182712  0.880098  0.812335  0.667889  0.958414  0.925715  0.748249  -0.081714 -0.122477  -0.107396

You can also use filter to split your columns:

# Python > 3.8, walrus operator
err = (df.filter(regex='_x$').values - (y := df.filter(regex='_y$'))) / y
err.columns = err.columns.str.split('_').str[0] + '_err'
df = pd.concat([df, err], axis=1)

sammywemmy · Answer

Since there is a balance between x and y, you can limit the for loop to just on the columns, and still be performant. The solution below uses MultiIndex, and pass the bulk of the processing as much as possible to Pandas:

temp = df.filter(like = '_')
temp.columns = temp.columns.str.split("_", expand = True)
temp = temp.swaplevel(axis='columns')
temp = temp.x.sub(temp.y).div(temp.y).add_suffix('_err')
df.assign(**temp)

         foo       bar       baz       a_x       b_x       c_x       a_y       b_y       c_y      a_err     b_err      c_err
0   0.636962  0.269787  0.040974  0.016528  0.813270  0.912756  0.606636  0.729497  0.543625  -0.972755  0.114838   0.679017
1   0.935072  0.815854  0.002739  0.857404  0.033586  0.729655  0.175656  0.863179  0.541461   3.881166 -0.961091   0.347567
2   0.299712  0.422687  0.028320  0.124283  0.670624  0.647190  0.615385  0.383678  0.997210  -0.798040  0.747885  -0.351000
3   0.980835  0.685542  0.650459  0.688447  0.388921  0.135097  0.721488  0.525354  0.310242  -0.045796 -0.259697  -0.564545
4   0.485835  0.889488  0.934044  0.357795  0.571530  0.321869  0.594300  0.337911  0.391619  -0.397955  0.691361  -0.178106
5   0.890274  0.227158  0.623187  0.084015  0.832644  0.787098  0.239369  0.876484  0.058568  -0.649014 -0.050018  12.439042
6   0.336117  0.150279  0.450339  0.796324  0.230642  0.052021  0.404552  0.198513  0.090753   0.968411  0.161849  -0.426782
7   0.580332  0.298696  0.671995  0.199515  0.942113  0.365110  0.105495  0.629108  0.927155   0.891226  0.497538  -0.606204
8   0.440377  0.954590  0.499896  0.425229  0.620213  0.995097  0.948944  0.460045  0.757729  -0.551893  0.348158   0.313262
9   0.497423  0.529312  0.785786  0.414656  0.734484  0.711143  0.932060  0.114933  0.729015  -0.555119  5.390557  -0.024516
10  0.927424  0.967926  0.014706  0.863640  0.981195  0.957210  0.148764  0.972629  0.889936   4.805437  0.008807   0.075595
11  0.822374  0.479988  0.232373  0.801881  0.923530  0.266130  0.538934  0.442753  0.931017   0.487900  1.085882  -0.714151
12  0.040511  0.732006  0.614373  0.028365  0.719220  0.015992  0.757951  0.512759  0.929104  -0.962576  0.402648  -0.982788
13  0.066082  0.841317  0.066690  0.344310  0.430299  0.966062  0.562232  0.258865  0.241676  -0.387601  0.662254   2.997349
14  0.888118  0.225869  0.124555  0.288331  0.586123  0.554091  0.809711  0.560476  0.288421  -0.643909  0.045760   0.921116
15  0.412896  0.818121  0.626506  0.959078  0.369404  0.552612  0.593924  0.848291  0.145474   0.614815 -0.564531   2.798708
16  0.406510  0.909959  0.043067  0.822706  0.415384  0.829804  0.009955  0.365046  0.078630  81.646166  0.137895   9.553270
17  0.652615  0.273849  0.702652  0.943801  0.126817  0.864778  0.059464  0.380771  0.429774  14.871772 -0.666946   1.012170
18  0.488850  0.976462  0.775691  0.308857  0.269837  0.863120  0.881307  0.510707  0.344296  -0.649546 -0.471640   1.506915
19  0.994917  0.315944  0.182712  0.880098  0.812335  0.667889  0.958414  0.925715  0.748249  -0.081714 -0.122477  -0.107396

Compute m columns from 2m columns without a for loop

Tags:

python

pandas

apply

calculated-columns

DeltaIV

2 Answers

Corralien

sammywemmy

Recent Activity

Donate For Us

Compute m columns from 2m columns without a for loop

Tags:

python

pandas

apply

calculated-columns

DeltaIV

2 Answers

Corralien

sammywemmy

Related questions

Recent Activity

Donate For Us