Let's say I have a pd.DataFrame that looks as such:
id col1_a col1_b col2_a col2_b
1 x x 2 3
2 z d 4 5
3 y y 9 9
4 p p 8 1
What this dataframe represents is a 2 dataframe (df_a, df_b) column by column comparison.
I am trying to get a dataframe that highlights and finds the columns that contain those differences as such:
id col1_a col1_b col2_a col2_b diff
1 x x 2 3 col2
2 z d 4 5 col1,col2
3 y y 9 9 None
4 p p 8 1 col2
How can I achieve something like this without having to doubly traverse through the cols and rows.
I know I can achieve this by doing something similar (not tested):
for col_ptr1 in df.columns:
for col_ptr2 in df.columns:
for idx, row in df.iterrows():
if col_ptr1.strip('_a') == col_ptr2.strip('_b'):
blah blah blah...
This is super ugly. I wonder if there is a more pandas style approach to this.
Select the subset of columns containing col, then split these column names around delimiter _ and extract the first component of split using the str accessor
Now, group the dataframe on the col prefix extracted in the previous step, and agg using nunique along axis=1 to count the unqiue values. Check for the unique values if not equal to one then add the corresponding column names in diff columns using dataframe.dot
c = df.filter(regex=r'_(a|b)$')
m = c.groupby(c.columns.str.split('_').str[0], axis=1).nunique().ne(1)
df['diff'] = m.dot(m.columns + ',').str[:-1]
id col1_a col1_b col2_a col2_b diff
0 1 x x 2 3 col2
1 2 z d 4 5 col1,col2
2 3 y y 9 9
3 4 p p 8 1 col2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With