Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter dataframe based on multiple columns of another dataframe

Tags:

python

pandas

I have a dataframe like this:

    ID1    ID2
0   foo    bar
1   fizz   buzz

And another like this:

    ID1    ID2    Count    Code   
0   abc    def      1        A
1   fizz   buzz     5        A
2   fizz1  buzz2    3        C
3   foo    bar      6        Z
4   foo    bar      6        Z

What I would like to do is filter the second dataframe where ID1 and ID2 match a row in the first dataframe, and whenever there's a match I want to remove that row from the first dataframe to avoid duplicates. This would yield a dataframe that looks like this:

    ID1    ID2    Count    Code   
1   fizz   buzz     5        A
3   foo    bar      6        Z

I know I can do this by nesting for loops, stepping through all the rows, and manually removing a row from the first frame whenever I get a match but I am wondering if there is a more pythonic way to do this. I am not experienced in pandas so there may be a much cleaner way to do that I do not know about. I was previously using .isin() but had to scrap it. Each ID pair can exist in the dataframe up to N times and I need the filtered frame to contain between 0 and N instances of a pair.

like image 877
Son of a Sailor Avatar asked Aug 01 '17 14:08

Son of a Sailor


2 Answers

Try this:

df2.merge(df1[['ID1','ID2']])
like image 29
MaxU - stop WAR against UA Avatar answered Sep 20 '22 10:09

MaxU - stop WAR against UA


Use merge with drop_duplicates, if only same columns for join in both df:

df = pd.merge(df1,df2.drop_duplicates())
print (df)
    ID1   ID2  Count Code
0   foo   bar      6    Z
1  fizz  buzz      5    A

If need check dupes only in ID columns:

df = pd.merge(df1,df2.drop_duplicates(subset=['ID1','ID2']))
print (df)
    ID1   ID2  Count Code
0   foo   bar      6    Z
1  fizz  buzz      5    A

If more columns are overlaping add parameter on:

df = pd.merge(df1, df2.drop_duplicates(), on=['ID1','ID2'])

If not remove dupe rows:

df = pd.merge(df1,df2)
print (df)
    ID1   ID2  Count Code
0   foo   bar      6    Z
1   foo   bar      6    Z
2  fizz  buzz      5    A
like image 171
jezrael Avatar answered Sep 18 '22 10:09

jezrael