Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Check if pandas dataframe is subset of other dataframe

I have two Python Pandas dataframes A, B, with the same columns (obviously with different data). I want to check A is a subset of B, that is, all rows of A are contained in B.

Any idea how to do it?

like image 438
Paul Avatar asked Mar 28 '18 09:03

Paul


2 Answers

Method DataFrame.merge(another_DF) merges on the intersection of the columns by default (uses all columns with same names from both DFs) and uses how='inner' - so we expect to have the same # of rows after inner join (if neither of DFs has duplicates):

len(A.merge(B)) == len(A)

PS it will NOT work properly if one of DFs have duplicated rows - see below for such cases

Demo:

In [128]: A
Out[128]:
   A  B  C
0  1  2  3
1  4  5  6

In [129]: B
Out[129]:
   A  B  C
0  4  5  6
1  1  2  3
2  9  8  7

In [130]: len(A.merge(B)) == len(A)
Out[130]: True

for data sets containing duplicates, we can remove duplicates and use the same method:

In [136]: A
Out[136]:
   A  B  C
0  1  2  3
1  4  5  6
2  1  2  3

In [137]: B
Out[137]:
   A  B  C
0  4  5  6
1  1  2  3
2  9  8  7
3  4  5  6

In [138]: A.merge(B).drop_duplicates()
Out[138]:
   A  B  C
0  1  2  3
2  4  5  6

In [139]: len(A.merge(B).drop_duplicates()) == len(A.drop_duplicates())
Out[139]: True
like image 71
MaxU - stop WAR against UA Avatar answered Sep 20 '22 12:09

MaxU - stop WAR against UA


You also can try:

ex = pd.DataFrame({"col1": ["banana", "tomato", "apple"],
               "col2": ["cat", "dog", "kangoo"],
               "col3": ["tv", "phone", "ps4"]})
ex2 = ex.iloc[0:2]
ex2.isin(ex).all().all()

It returns True

If you try to switch some values such as tv and phone you get a False value

ex2 = pd.DataFrame({"col1": ["banana", "tomato"],
               "col2": ["cat", "dog"],
               "col3": ["phone", "tv"]})
ex2.isin(ex).all().all()
>> False
like image 41
J. Doe Avatar answered Sep 20 '22 12:09

J. Doe