I cannot find an easy way to get all the rows of a data frame that are found in one dataframe but not a second dataframe if the data is unordered.
These two answers talk are solutions for ordered data:
Get rows that are present in one dataframe, but not the other
pandas get rows which are NOT in other dataframe
So just to make it clear I'm trying to get this:
In one of those related question mentioned above I found a multiindexing solution that supposedly works with unordered data, but I was unable to implement it. I am hoping theres an easier way.
let me give you an example of the data I'm working with:
DF1
col_a col_b
1325 foo
1397 foo #<---matching value, but not matching index in DF2
1645 foo
... ...
DF2
col_1 col_2
1397 foo #<---matching value, but not matching index in DF1
1500 foo
1621 foo
... ...
Now if that were all the data in both dataframes the result for processing this specifically for DF1 would look like this:
DF1_UNIQUE
col_a col_b
1325 foo
1645 foo
(So I'm really only caring about col_a
or for DF2 col_1
). Notice its missing the 1397 row. that's because it is found in DF2, so I don't want it returned to my new DF. But its not found in the same index and there in lies the problem I have. I already easily created a solution if all the matching indexes are lined up, but I don't know where to start on the indexes that aren't lined up. Can I use the merge function? Or is that the wrong tool for this job?
This code isn't entirely relevant but its the solution I came up with if all the indexes lined up correctly:
def getUniqueEntries(df1, df2):
"""takes two dataframes, returns a dataframe that is comprized of all the rows unique to the first dataframe."""
d1columns = df1.columns
d2columns = df2.columns
df3 = pd.merge(df1, df2, left_on=d1columns[0], right_on=d2columns[0])
print(df3)
return df1[(~df1[d1columns[0]].isin(df3[d1columns[0]]))]
def main(fileread1, fileread2, writeprefix):
df1 = pd.read_csv(fileread1)
df2 = pd.read_csv(fileread2)
df3 = getUniqueEntries(df1, df2)
df4 = getUniqueEntries(df2, df1)
print(df3)
print(df4)
df3.to_csv(writeprefix+fileread1, index=False)
df4.to_csv(writeprefix+fileread2, index=False)
if __name__ == '__main__':
main(sys.argv[1], sys.argv[2], sys.argv[3])
Yes, you can use merge with the indicator
parameter:
I renamed the columns to avoid duplicated columns You can also pass left_on
and right_on
merged = DF1.merge(DF2.rename(columns={'col_1': 'col_a', 'col_2': 'col_b'}), how='left', indicator=True)
merged
Out:
col_a col_b _merge
0 1325 foo left_only
1 1397 foo both
2 1645 foo left_only
Now, you can filter merged
using the indicator column:
merged[merged['_merge']=='left_only']
Out:
col_a col_b _merge
0 1325 foo left_only
2 1645 foo left_only
This uses boolean indexing to locate all of the rows in df1
where the values in col_a
are NOT (~
) in col_a
of df2
. It uses isin()
to locate matching rows, and the negation operator (~
) to find the opposite of those (i.e. the ones that don't match).
df1[~df1.col_a.isin(df2.col_a)]
You mentioned an index, but your sample data does not have one. The matching is thus done only on the values in col_a
per your example.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With