Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas getting all rows listed in one dataframe but not the other UNORDERD

I cannot find an easy way to get all the rows of a data frame that are found in one dataframe but not a second dataframe if the data is unordered.

These two answers talk are solutions for ordered data:

Get rows that are present in one dataframe, but not the other

pandas get rows which are NOT in other dataframe

So just to make it clear I'm trying to get this: data from one dataframe thats not found in the other dataframe

In one of those related question mentioned above I found a multiindexing solution that supposedly works with unordered data, but I was unable to implement it. I am hoping theres an easier way.

let me give you an example of the data I'm working with:

DF1
col_a   col_b
1325    foo
1397    foo        #<---matching value, but not matching index in DF2
1645    foo
...     ...

DF2
col_1   col_2
1397    foo        #<---matching value, but not matching index in DF1
1500    foo
1621    foo
...     ...

Now if that were all the data in both dataframes the result for processing this specifically for DF1 would look like this:

DF1_UNIQUE
col_a   col_b
1325    foo
1645    foo

(So I'm really only caring about col_a or for DF2 col_1). Notice its missing the 1397 row. that's because it is found in DF2, so I don't want it returned to my new DF. But its not found in the same index and there in lies the problem I have. I already easily created a solution if all the matching indexes are lined up, but I don't know where to start on the indexes that aren't lined up. Can I use the merge function? Or is that the wrong tool for this job?

This code isn't entirely relevant but its the solution I came up with if all the indexes lined up correctly:

def getUniqueEntries(df1, df2):
    """takes two dataframes, returns a dataframe that is comprized of all the rows unique to the first dataframe."""
    d1columns = df1.columns
    d2columns = df2.columns
    df3 = pd.merge(df1, df2, left_on=d1columns[0], right_on=d2columns[0])
    print(df3)
    return df1[(~df1[d1columns[0]].isin(df3[d1columns[0]]))]     

def main(fileread1, fileread2, writeprefix):
    df1 = pd.read_csv(fileread1)
    df2 = pd.read_csv(fileread2)

    df3 = getUniqueEntries(df1, df2)
    df4 = getUniqueEntries(df2, df1)
    
    print(df3)
    print(df4)

    df3.to_csv(writeprefix+fileread1, index=False)
    df4.to_csv(writeprefix+fileread2, index=False)
    
if __name__ == '__main__':
    main(sys.argv[1], sys.argv[2], sys.argv[3])
like image 875
Legit Stack Avatar asked Dec 18 '22 14:12

Legit Stack


2 Answers

Yes, you can use merge with the indicator parameter:

I renamed the columns to avoid duplicated columns You can also pass left_on and right_on

merged = DF1.merge(DF2.rename(columns={'col_1': 'col_a', 'col_2': 'col_b'}), how='left', indicator=True)
merged
Out: 
   col_a col_b     _merge
0   1325   foo  left_only
1   1397   foo       both
2   1645   foo  left_only

Now, you can filter merged using the indicator column:

merged[merged['_merge']=='left_only']
Out: 
   col_a col_b     _merge
0   1325   foo  left_only
2   1645   foo  left_only
like image 186
ayhan Avatar answered Dec 21 '22 10:12

ayhan


This uses boolean indexing to locate all of the rows in df1 where the values in col_a are NOT (~) in col_a of df2. It uses isin() to locate matching rows, and the negation operator (~) to find the opposite of those (i.e. the ones that don't match).

df1[~df1.col_a.isin(df2.col_a)]

You mentioned an index, but your sample data does not have one. The matching is thus done only on the values in col_a per your example.

like image 22
Alexander Avatar answered Dec 21 '22 09:12

Alexander