I know i can do like below if we are checking only two columns together.
df['flag'] = df['a_id'].isin(df['b_id'])
where df
is a data frame, and a_id
and b_id
are two columns of the data frame. It will return True
or False
value based on the match. But i need to compare multiple columns together.
For example: if there are a_id , a_region, a_ip, b_id, b_region and b_ip
columns. I want to compare like below,
a_key = df['a_id'] + df['a_region] + df['a_ip']
b_key = df['b_id'] + df['b_region] + df['b_ip']
df['flag'] = a_key.isin(b_key)
Somehow the above code is always returning False
value. The output should be like below,
First row flag will be True because there is a match.
a_key
becomes 2a10
this is match with last row of b_key
(2a10)
Note that the columns of dataframes are data series. So if you take two columns as pandas series, you may compare them just like you would do with numpy arrays. "I'd like to check if a person in one data frame is in another one." The condition is for both name and first name be present in both dataframes and in the same row.
Comparing column names of two dataframes. Incase you are trying to compare the column names of two dataframes: If df1 and df2 are the two dataframes: set (df1.columns).intersection (set (df2.columns)) This will provide the unique column names which are contained in both the dataframes. Example:
There are multiple ways to compare column values in 2 different excel files. The approach here checks each sequence in the Unknown seq column from the first file with each sequence in the Reference_sequences column from the second file.
By using the Where () method in NumPy, we are given the condition to compare the columns. If ‘column1’ is lesser than ‘column2’ and ‘column1’ is lesser than the ‘column3’, We print the values of ‘column1’.
You were going in the right direction, just use:
a_key = df['a_id'].astype(str) + df['a_region'] + df['a_ip'].astype(str)
b_key = df['b_id'].astype(str) + df['b_region'] + df['b_ip'].astype(str)
a_key.isin(b_key)
Mine is giving below results:
0 True
1 False
2 False
You can use isin
with DataFrame
as value, but as per the docs:
If values is a DataFrame, then both the index and column labels must match
So this should work:
# Removing the prefixes from column names
df_a = df[['a_id', 'a_region', 'a_ip']].rename(columns=lambda x: x[2:])
df_b = df[['b_id', 'b_region', 'b_ip']].rename(columns=lambda x: x[2:])
# Find rows where all values are in the other
matched = df_a.isin(df_b).all(axis=1)
# Get actual rows with boolean indexing
df_a.loc[matched]
# ... or add boolean flag to dataframe
df['flag'] = matched
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With