I have a table in which there are over 200 columns. The columns come with different pairs (e.g., two types of Benz), below is an example. What I want to do is to calculate the difference of each pair for a new column (like the last column as an example). I was thinking to separate the tables with two tables (A & B) according to the first letter, and sort the column. But is there a more efficient way in Pandas? Thanks!
A_Benz B_Benz A_Audi B_Audi A_Honda B_Honda dif_Audi
1 0 1 1 0 0 0
1 0 0 1 0 0 -1
1 0 0 1 0 0 -1
1 0 1 1 1 1 0
1 0 0 1 0 0 -1
Assuming this is your starting point -
df
A_Benz B_Benz A_Audi B_Audi A_Honda B_Honda
1 1 0 1 1 0 0
2 1 0 0 1 0 0
3 1 0 0 1 0 0
4 1 0 1 1 1 1
5 1 0 0 1 0 0
Option 1
This would make a nice use case for filter
:
i = df.filter(regex='^A_*')
j = df.filter(regex='^B_*')
i.columns = i.columns.str.split('_', 1).str[-1]
j.columns = j.columns.str.split('_', 1).str[-1]
(i - j).add_prefix('diff_')
diff_Benz diff_Audi diff_Honda
1 1 0 0
2 1 -1 0
3 1 -1 0
4 1 0 0
5 1 -1 0
If you want to add this back to the original dataframe, you'd use concat
df = pd.concat([df, (i - j).add_prefix('diff_')], axis=1)
Option 2
An alternative using diff
; this does a lot of unnecessary subtraction:
import re
# if needed, order the columns correctly
df = df[sorted(df.columns, key=lambda x: x.split('_', 1)[1])]
# compute consecutive column differences
df.diff(-1, axis=1).iloc[:, ::2].rename(columns=lambda x: re.sub('A_', 'diff_', x))
diff_Benz diff_Audi diff_Honda
1 1.0 0.0 0.0
2 1.0 -1.0 0.0
3 1.0 -1.0 0.0
4 1.0 0.0 0.0
5 1.0 -1.0 0.0
A more performant version of this would be (similar to @jpp's method) -
c = sorted(df.columns, key=lambda x: x.split('_', 1)[1])
df = df[c]
pd.DataFrame(
df.iloc[:, ::2].values - df.iloc[:, 1::2].values, columns=c[::2]
)
A_Audi A_Benz A_Honda
0 0 1 0
1 -1 1 0
2 -1 1 0
3 0 1 0
4 -1 1 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With