Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Separate DataFrame into parts based on column prefix and perform arithmetic on the parts

I have a table in which there are over 200 columns. The columns come with different pairs (e.g., two types of Benz), below is an example. What I want to do is to calculate the difference of each pair for a new column (like the last column as an example). I was thinking to separate the tables with two tables (A & B) according to the first letter, and sort the column. But is there a more efficient way in Pandas? Thanks!

A_Benz  B_Benz  A_Audi  B_Audi  A_Honda B_Honda dif_Audi
1   0   1   1   0   0   0
1   0   0   1   0   0   -1
1   0   0   1   0   0   -1
1   0   1   1   1   1   0
1   0   0   1   0   0   -1
like image 828
Alex Xu Avatar asked Jan 03 '23 14:01

Alex Xu


1 Answers

Assuming this is your starting point -

df
   A_Benz  B_Benz  A_Audi  B_Audi  A_Honda  B_Honda
1       1       0       1       1        0        0
2       1       0       0       1        0        0
3       1       0       0       1        0        0
4       1       0       1       1        1        1
5       1       0       0       1        0        0

Option 1
This would make a nice use case for filter:

i = df.filter(regex='^A_*')
j = df.filter(regex='^B_*')

i.columns = i.columns.str.split('_', 1).str[-1]
j.columns = j.columns.str.split('_', 1).str[-1]

(i - j).add_prefix('diff_')

   diff_Benz  diff_Audi  diff_Honda
1          1          0           0
2          1         -1           0
3          1         -1           0
4          1          0           0
5          1         -1           0

If you want to add this back to the original dataframe, you'd use concat

df = pd.concat([df, (i - j).add_prefix('diff_')], axis=1)

Option 2
An alternative using diff; this does a lot of unnecessary subtraction:

import re

# if needed, order the columns correctly
df = df[sorted(df.columns, key=lambda x: x.split('_', 1)[1])]
# compute consecutive column differences
df.diff(-1, axis=1).iloc[:, ::2].rename(columns=lambda x: re.sub('A_', 'diff_', x))

   diff_Benz  diff_Audi  diff_Honda
1        1.0        0.0         0.0
2        1.0       -1.0         0.0
3        1.0       -1.0         0.0
4        1.0        0.0         0.0
5        1.0       -1.0         0.0

A more performant version of this would be (similar to @jpp's method) -

c = sorted(df.columns, key=lambda x: x.split('_', 1)[1])
df = df[c]

pd.DataFrame(
    df.iloc[:, ::2].values - df.iloc[:, 1::2].values, columns=c[::2]
)

   A_Audi  A_Benz  A_Honda
0       0       1        0
1      -1       1        0
2      -1       1        0
3       0       1        0
4      -1       1        0
like image 80
cs95 Avatar answered Jan 11 '23 14:01

cs95