Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparing two DataFrames by one column with a return of three different outputs with Panadas

I am beginner in Python and coding. I need help comparing two dataframes of different lengths and with different column labels except one. The column that is the same between the two datasets is the column I want to compare the dataframe by. My data looks like this:

    df:  'fruits'  'trees'      'sports'    'countries'  

          bananas   mongolia     basketball    Spain
          grapes    Oak          rugby         Thailand
          oranges   Osage Orange baseball      Egypt
          apples    Maple        golf          Chile

    df2: 'cars'  'flowers'     'countries'    'vegetables'

          Audi    Rose          Spain           Carrots
          BMW     Tulip         Nigeria         Celery
          Honda   Dandelion     Egypt           Onion

I would to compare these two dataframes based on the column 'countries'and create three separate outputs each in their own dataframe. I have been using Pandas and have used pd.concat to combine df1 and df2 into one. I would also like to keep the rows of the rest of the dataframe even though they don't match.

Here are my desired outputs:

Output# 1: Values in df NOT in df2:

    d3:  'fruits'  'trees'      'sports'    'countries'  

          grapes    Oak            rugby         Thailand
          apples    Maple          golf          Chile

Output# 2: Values in df2 NOT in df

        df4: 'cars'  'flowers'   'countries'    'vegetables'

              BMW     Tulip       Nigeria         Celery

Output# 3: Values in both df AND df2 (with the columns from the different dataframes combined.)

df5: 'fruits'  'trees' 'sports'  'cars' 'flowers' 'countries' 'vegetables'  

  bananas   mongolia  basketball   Audi    Rose      Spain    Carrots 
Oranges  Osage Orange baseball    Honda   Dandelion  Egypt    Onion

Hope this all makes sense. I have tried so many different things (isin, DataFrame.diff and .difference, df-df2, numpy arrays, etc.) I have looked all over and I can't find exactly what I'm looking for. Any help would be greatly appreciated! Thank you!

like image 839
J.L. Avatar asked Sep 08 '16 21:09

J.L.


People also ask

How do I compare two DataFrames in pandas and return differences?

By using equals() function we can directly check if df1 is equal to df2. This function is used to determine if two dataframe objects in consideration are equal or not. Unlike dataframe. eq() method, the result of the operation is a scalar boolean value indicating if the dataframe objects are equal or not.

How do I compare two DataFrames based on a column?

The compare method in pandas shows the differences between two DataFrames. It compares two data frames, row-wise and column-wise, and presents the differences side by side. The compare method can only compare DataFrames of the same shape, with exact dimensions and identical row and column labels.


1 Answers

Setup Reference

from StringIO import StringIO
import pandas as pd

txt1 = """fruits,trees,sports,countries
bananas,mongolia,basketball,Spain
grapes,Oak,rugby,Thailand
oranges,Osage,Orange baseball,Egypt
apples,Maple,golf,Chile"""

txt2 = """cars,flowers,countries,vegetables
Audi,Rose,Spain,Carrots
BMW,Tulip,Nigeria,Celery
Honda,Dandelion,Egypt,Onion"""

df = pd.read_csv(StringIO(txt1))

df2 = pd.read_csv(StringIO(txt2))

Solution

def outer_parts(df1, df2):
    df3 = df1.merge(df2, indicator=True, how='outer')
    return {n: g.drop('_merge', 1) for n, g in df3.groupby('_merge')}


dfs = outer_parts(df, df2)

Demonstration

dfs['both']

enter image description here

dfs['left_only']

enter image description here

dfs['right_only']

enter image description here

like image 90
piRSquared Avatar answered Oct 14 '22 05:10

piRSquared