Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find entries that do not match between columns and iterate through columns

I have two datasets that I need to validate against. All records should match. I am having trouble in determining how to iterate through each different column.

import pandas as pd 
import numpy as np

df = pd.DataFrame([['charlie', 'charlie', 'beta', 'cappa'], ['charlie', 'charlie', 'beta', 'delta'], ['charlie', 'charlie', 'beta', 'beta']], columns=['A_1', 'A_2','B_1','B_2'])

df.head()

Out[83]: 
       A_1      A_2   B_1    B_2
0  charlie  charlie  beta  cappa
1  charlie  charlie  beta  delta
2  charlie  charlie  beta   beta

For example, in the above code, I want to compare A_1 to A_2, and B_1 to B_2, to return a new column, A_check and B_check respectively, that return True if A_1 matches A_2 as the A_Check for instance.

Something like this:

df['B_check'] = np.where((df['B_1'] == df['B_2']), 'True', 'False')
df_subset = df[df['B_check']=='False'] 

But iterable across any given column names, where columns that need to be checked against will always have the same name before the underscore and always have 1 or 2 after the underscore.

Ultimately, the actual task has multiple data frames with varying columns to check, as well as varying numbers of columns to check. The output I am ultimately going for is a data frame that shows all the records that were false for any particular column check.

like image 415
Devin Avatar asked Feb 28 '20 20:02

Devin


1 Answers

With a bit more comprehensive regex:

from itertools import groupby
import re

for k, cols in groupby(sorted(df.columns), lambda x: x[:-2] if re.match(".+_(1|2)$", x) else None):
    cols=list(cols)
    if(len(cols)==2 and k):
        df[f"{k}_check"]=df[cols[0]].eq(df[cols[1]])

It will pair together only columns which name ends up with _1 and _2 regardless what you have before in their names, calculating _check only if there are 2- _1 and _2 (assuming you don't have 2 columns with the same name).

For the sample data:

       A_1      A_2   B_1    B_2  A_check  B_check
0  charlie  charlie  beta  cappa     True    False
1  charlie  charlie  beta  delta     True    False
2  charlie  charlie  beta   beta     True     True
like image 96
Grzegorz Skibinski Avatar answered Nov 15 '22 20:11

Grzegorz Skibinski