Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How python can get difference between all pairs of rows under multiple columns

I have two CSV files and both files having multiple columns & rows, i'm looking forward to get the difference across all the rows of both the files. Lets suppose if there is difference in Asset Tag Number between the files then highlight the differences in any form(may be Bold the values or something appropriate), Moreover, we have a Key here is Serial Number which is unique on both the files. So, Would be good to take the difference of the rows into a new.csv file and highlight the differences while removing the identical rows.

Just for the refe, i have Both the Files having more than 100 columns..

My actual Data columns are like below on both the csv files.

Columns: [Asset Tag Number_a, Serial Number_a, System Name_a, Domain_a, System manufacturer_a, Model Name_a, System Type_a, Critical Level_a, Purpose Level 1_a, Purpose2_a, ShareIndv_a, Site_a, Building_a, Room_a, Rack_a, serverCostCenter_a, User ID   BU Grp Mgr_a, OS Name_a, OS Version_a, OS Type_a, Service Pack_a, Notification Group_a, Off The Network_a, First Name_a, Last Name_a, Manager Name_a, Status_a, BU Cost Center_a, BU CC Description_a, Organization Name_a, Higher Level BU_a, Business Contact_a, Description_a, Asset Type_a, System Type SW_a, Server _a, Host ID(Unix)_a, IP Address_a, MAC Address_a, Installed RAM_a, Disk Capacity_a, Installed Disk_a, Server Status _a, High Level Status_a, Lifecycle Status_a, EndOfLifeDate_a, Last Audit_a, AltVersion_a, BIOS Vendor_a, BIOS Version_a, BIOS Release Date_a, SMBIOS Enabled_a, SMBios Version_a, Region_a, Currency_a, Acquisition Cost USD_a, Net Book Value USD_a, CPU Type_a, CPU Speed_a, Acquisition Date_a, Age_a, DateModified_a, Altiris Exception_a, Inventory Owner_a, Last Logon User_a, Inventory Owner Last Logon User_a, Client Date_a, Reporting Status_a, Contact Status_a, Comments_a, Exception Reason_a, DNR_a, Asset Tag Number_b, Serial Number_b, System Name_b, Domain_b, System manufacturer_b, Model Name_b, System Type_b, Critical Level_b, Purpose Level 1_b, Purpose2_b, ShareIndv_b, Site_b, Building_b, Room_b, Rack_b, serverCostCenter_b, User ID   BU Grp Mgr_b, OS Name_b, OS Version_b, OS Type_b, Service Pack_b, Notification Group_b, Off The Network_b, First Name_b, Last Name_b, Manager Name_b, Status_b, BU Cost Center_b, ...]
Index: []

As being a newbie pandas learner i applied fews methods of code but doesn't seems to be a nearby fit hence seeking a generous help & suggestions..

1) First code tried..

#!/grid/common/pkgs/python/v3.6.1/bin/python3
import pandas as pd

A = pd.read_csv('a.csv', index_col=0)
B = pd.read_csv('b.csv', index_col=0)

C = pd.merge(left=A,right=B, how='outer', left_index=True, right_index=True, suffixes=['_a', '_b'])

not_in_a = C.drop( A.index )
not_in_b = C.drop( B.index )

not_in_a.to_csv('not_in_a.csv')
not_in_b.to_csv('not_in_b.csv')

2) tried another code but output is so much width which tough to read, whereas this snippet should drop the duplicates and will print only the one who are in difference..

from __future__ import print_function
from signal import signal, SIGPIPE, SIG_DFL
signal(SIGPIPE,SIG_DFL)
import csv
import pandas as pd


##### Python pandas, widen output display to see more columns. ####
pd.set_option('display.height', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('expand_frame_repr', True)

a = pd.read_csv('a.csv')
b = pd.read_csv('b.csv')
c = pd.concat([a,b], axis=0)

c.drop_duplicates(keep='first', inplace=True)
c.reset_index(drop=True, inplace=True)
print(c)

I did some google searching and found some stack overflow discussion on the topic. However, there are some decent solutions in the thread but nothing that I felt would meet my requirements hence i posted here.

3) another code applied with python sets which works partially ..

#!/grid/common/pkgs/python/v3.6.1/bin/python3
import os
orig = open('aa.csv','r')
new = open('bb.csv','r')
bigb = set(new) - set(orig)
print(bigb)
# Write to output file
with open('different.csv', 'w') as file_out:
    for line in bigb:
        file_out.write(line)
    orig.close()
    new.close()
    file_out.close()

I have below two sample file for reference which looks similar to my data,where we can take the Serial Number as a key for out logic and code.

Below are my two csv files file1.csv & file2.csv

File1:

wrkStaId                     Asset Tag Number  Serial Number System Name
                                                              mac-ymatsuok2
                                                              PC-ABNER-W10
                                                              PC-ADAMLIN-W10                                                                                              
{ED0CCFFD-28D6-4170-9DE9-0DFB83F49193}    1234     ser123         sfreder
{8AEAF485-A4FF-460C-91FA-0DFCAD79DD24}    3456     ser124         10210277      
{E6204B69-DABB-4A1E-906B-0DFD2BCEDA41}    456      ser345         A313819
{445EC096-A70C-47D1-91FF-0DFE747F762A}    4485     ser900         dgs1sj

Sample File2:

    wrkStaId                Asset Tag Number Serial Number  System Name
                                                            mac-ymatsuok2
                                                            PC-Karn-W10
                                                            PC-ADAMLIN-W10
                                                            PC-ADRIANA-W10
   {ED0CCFFD-28D6-4170-9DE9-0DFB83F49193}   1234 ser123     sfreder
   {8AEAF485-A4FF-460C-91FA-0DFCAD79DD24}   3456 ser124     10210277
   {E6204B69-DABB-4A1E-906B-0DFD2BCEDA41}   1709 ser345     A313819
   {445EC096-A70C-47D1-91FF-0DFE747F762A}   4485 ser900     dgs1sj

Desired Result: How do you want the difference represented, as these are non-numeric values. Do you want to print both rows in case they differ into a new file, and drop them if they are the same?

ANS: Yes

Desired output..

Differce in File1 which is notin file2

wrkStaId                     Asset Tag Number  Serial Number System Name
                                                              PC-ABNER-W10                                                                                                
{E6204B69-DABB-4A1E-906B-0DFD2BCEDA41}    456      ser345         A313819

Differce in File2 which is not in file1

    wrkStaId                Asset Tag Number Serial Number  System Name
                                                            PC-Karn-W10
                                                            PC-ADRIANA-W10
   {E6204B69-DABB-4A1E-906B-0DFD2BCEDA41}   1709 ser345     A313819

Much thanks to the @w-m, however i'm still hopeful to spill the some more ideas from the experts from SO.

like image 882
Karn Kumar Avatar asked Aug 01 '18 04:08

Karn Kumar


People also ask

How do you find the difference between rows in Python?

During data analysis, one might need to compute the difference between two rows for comparison purposes. This can be done using pandas. DataFrame. diff() function.

How do I check if two rows have the same value in pandas?

The equals() function is used to test whether two Pandas objects contain the same elements.

Can pandas series have multiple columns?

In Pandas, we have the freedom to add columns in the data frame whenever needed. There are multiple ways to add columns to pandas dataframe.


1 Answers

Your data seems to contain two parts: a list of System Names, and then a table of rows. As the structure is quite different, I suggest you split the data into a list of System Names and the full rows and process them separately.

First extract the System Name lists:

l1 = df1[df1.wrkStaId == ""].System_Name
l2 = df2[df2.wrkStaId == ""].System_Name

You can get the difference with Python set difference code:

>>> set(l1).difference(set(l2))
{'PC-ABNER-W10'}
>>> set(l2).difference(set(l1))
{'PC-ADRIANA-W10', 'PC-Karn-W10'}

Now drop the empty wrkStaId entries:

df1 = df1[df1.wrkStaId != ""].set_index("wrkStaId")
df2 = df2[df1.wrkStaId != ""].set_index("wrkStaId")

The rest of the data now contains full rows with wrkStaId as the index.

df1:

                                        Asset_Tag_Number Serial_Number System_Name
wrkStaId                                                                          
{ED0CCFFD-28D6-4170-9DE9-0DFB83F49193}            1234.0        ser123     sfreder
{8AEAF485-A4FF-460C-91FA-0DFCAD79DD24}            3456.0        ser124    10210277
{E6204B69-DABB-4A1E-906B-0DFD2BCEDA41}             456.0        ser345     A313819
{445EC096-A70C-47D1-91FF-0DFE747F762A}            4485.0        ser900      dgs1sj

df2:

                                        Asset_Tag_Number Serial_Number System_Name
wrkStaId                                                                          
{ED0CCFFD-28D6-4170-9DE9-0DFB83F49193}            1234.0        ser123     sfreder
{8AEAF485-A4FF-460C-91FA-0DFCAD79DD24}            3456.0        ser124    10210277
{E6204B69-DABB-4A1E-906B-0DFD2BCEDA41}            1709.0        ser345     A313819
{445EC096-A70C-47D1-91FF-0DFE747F762A}            4485.0        ser900      dgs1sj

You can now do the set differences on the pandas df's like this:

>>> df1[~df1.isin(df2).all(1)]
                                        Asset_Tag_Number Serial_Number System_Name
wrkStaId                                                                          
{E6204B69-DABB-4A1E-906B-0DFD2BCEDA41}             456.0        ser345     A313819

>>> df2[~df2.isin(df1).all(1)]
                                            Asset_Tag_Number Serial_Number System_Name
wrkStaId                                                                          
{E6204B69-DABB-4A1E-906B-0DFD2BCEDA41}            1709.0        ser345     A313819

You may need to adapt the code a little to get exactly what you want, but I hope this gets you going.

like image 188
w-m Avatar answered Nov 15 '22 06:11

w-m