I have two CSV files and both files having multiple columns & rows, i'm looking forward to get the difference across all the rows of both the files. Lets suppose if there is difference in Asset Tag Number
between the files then highlight the differences in any form(may be Bold the values or something appropriate), Moreover, we have a Key here is Serial Number
which is unique on both the files. So, Would be good to take the difference of the rows into a new.csv file and highlight the differences while removing the identical rows.
Just for the refe, i have Both the Files having more than 100 columns..
My actual Data columns are like below on both the csv files.
Columns: [Asset Tag Number_a, Serial Number_a, System Name_a, Domain_a, System manufacturer_a, Model Name_a, System Type_a, Critical Level_a, Purpose Level 1_a, Purpose2_a, ShareIndv_a, Site_a, Building_a, Room_a, Rack_a, serverCostCenter_a, User ID BU Grp Mgr_a, OS Name_a, OS Version_a, OS Type_a, Service Pack_a, Notification Group_a, Off The Network_a, First Name_a, Last Name_a, Manager Name_a, Status_a, BU Cost Center_a, BU CC Description_a, Organization Name_a, Higher Level BU_a, Business Contact_a, Description_a, Asset Type_a, System Type SW_a, Server _a, Host ID(Unix)_a, IP Address_a, MAC Address_a, Installed RAM_a, Disk Capacity_a, Installed Disk_a, Server Status _a, High Level Status_a, Lifecycle Status_a, EndOfLifeDate_a, Last Audit_a, AltVersion_a, BIOS Vendor_a, BIOS Version_a, BIOS Release Date_a, SMBIOS Enabled_a, SMBios Version_a, Region_a, Currency_a, Acquisition Cost USD_a, Net Book Value USD_a, CPU Type_a, CPU Speed_a, Acquisition Date_a, Age_a, DateModified_a, Altiris Exception_a, Inventory Owner_a, Last Logon User_a, Inventory Owner Last Logon User_a, Client Date_a, Reporting Status_a, Contact Status_a, Comments_a, Exception Reason_a, DNR_a, Asset Tag Number_b, Serial Number_b, System Name_b, Domain_b, System manufacturer_b, Model Name_b, System Type_b, Critical Level_b, Purpose Level 1_b, Purpose2_b, ShareIndv_b, Site_b, Building_b, Room_b, Rack_b, serverCostCenter_b, User ID BU Grp Mgr_b, OS Name_b, OS Version_b, OS Type_b, Service Pack_b, Notification Group_b, Off The Network_b, First Name_b, Last Name_b, Manager Name_b, Status_b, BU Cost Center_b, ...]
Index: []
As being a newbie pandas learner i applied fews methods of code but doesn't seems to be a nearby fit hence seeking a generous help & suggestions..
1) First code tried..
#!/grid/common/pkgs/python/v3.6.1/bin/python3
import pandas as pd
A = pd.read_csv('a.csv', index_col=0)
B = pd.read_csv('b.csv', index_col=0)
C = pd.merge(left=A,right=B, how='outer', left_index=True, right_index=True, suffixes=['_a', '_b'])
not_in_a = C.drop( A.index )
not_in_b = C.drop( B.index )
not_in_a.to_csv('not_in_a.csv')
not_in_b.to_csv('not_in_b.csv')
2) tried another code but output is so much width which tough to read, whereas this snippet should drop the duplicates and will print only the one who are in difference..
from __future__ import print_function
from signal import signal, SIGPIPE, SIG_DFL
signal(SIGPIPE,SIG_DFL)
import csv
import pandas as pd
##### Python pandas, widen output display to see more columns. ####
pd.set_option('display.height', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('expand_frame_repr', True)
a = pd.read_csv('a.csv')
b = pd.read_csv('b.csv')
c = pd.concat([a,b], axis=0)
c.drop_duplicates(keep='first', inplace=True)
c.reset_index(drop=True, inplace=True)
print(c)
I did some google searching and found some stack overflow discussion on the topic. However, there are some decent solutions in the thread but nothing that I felt would meet my requirements hence i posted here.
3) another code applied with python sets which works partially ..
#!/grid/common/pkgs/python/v3.6.1/bin/python3
import os
orig = open('aa.csv','r')
new = open('bb.csv','r')
bigb = set(new) - set(orig)
print(bigb)
# Write to output file
with open('different.csv', 'w') as file_out:
for line in bigb:
file_out.write(line)
orig.close()
new.close()
file_out.close()
I have below two sample file for reference which looks similar to my data,where we can take the Serial Number
as a key for out logic and code.
Below are my two csv files file1.csv & file2.csv
File1:
wrkStaId Asset Tag Number Serial Number System Name
mac-ymatsuok2
PC-ABNER-W10
PC-ADAMLIN-W10
{ED0CCFFD-28D6-4170-9DE9-0DFB83F49193} 1234 ser123 sfreder
{8AEAF485-A4FF-460C-91FA-0DFCAD79DD24} 3456 ser124 10210277
{E6204B69-DABB-4A1E-906B-0DFD2BCEDA41} 456 ser345 A313819
{445EC096-A70C-47D1-91FF-0DFE747F762A} 4485 ser900 dgs1sj
Sample File2:
wrkStaId Asset Tag Number Serial Number System Name
mac-ymatsuok2
PC-Karn-W10
PC-ADAMLIN-W10
PC-ADRIANA-W10
{ED0CCFFD-28D6-4170-9DE9-0DFB83F49193} 1234 ser123 sfreder
{8AEAF485-A4FF-460C-91FA-0DFCAD79DD24} 3456 ser124 10210277
{E6204B69-DABB-4A1E-906B-0DFD2BCEDA41} 1709 ser345 A313819
{445EC096-A70C-47D1-91FF-0DFE747F762A} 4485 ser900 dgs1sj
Desired Result: How do you want the difference represented, as these are non-numeric values. Do you want to print both rows in case they differ into a new file, and drop them if they are the same?
ANS: Yes
Desired output..
Differce in File1 which is notin file2
wrkStaId Asset Tag Number Serial Number System Name
PC-ABNER-W10
{E6204B69-DABB-4A1E-906B-0DFD2BCEDA41} 456 ser345 A313819
Differce in File2 which is not in file1
wrkStaId Asset Tag Number Serial Number System Name
PC-Karn-W10
PC-ADRIANA-W10
{E6204B69-DABB-4A1E-906B-0DFD2BCEDA41} 1709 ser345 A313819
Much thanks to the @w-m, however i'm still hopeful to spill the some more ideas from the experts from SO.
During data analysis, one might need to compute the difference between two rows for comparison purposes. This can be done using pandas. DataFrame. diff() function.
The equals() function is used to test whether two Pandas objects contain the same elements.
In Pandas, we have the freedom to add columns in the data frame whenever needed. There are multiple ways to add columns to pandas dataframe.
Your data seems to contain two parts: a list of System Name
s, and then a table of rows. As the structure is quite different, I suggest you split the data into a list of System Name
s and the full rows and process them separately.
First extract the System Name
lists:
l1 = df1[df1.wrkStaId == ""].System_Name
l2 = df2[df2.wrkStaId == ""].System_Name
You can get the difference with Python set difference code:
>>> set(l1).difference(set(l2))
{'PC-ABNER-W10'}
>>> set(l2).difference(set(l1))
{'PC-ADRIANA-W10', 'PC-Karn-W10'}
Now drop the empty wrkStaId entries:
df1 = df1[df1.wrkStaId != ""].set_index("wrkStaId")
df2 = df2[df1.wrkStaId != ""].set_index("wrkStaId")
The rest of the data now contains full rows with wrkStaId
as the index.
df1:
Asset_Tag_Number Serial_Number System_Name
wrkStaId
{ED0CCFFD-28D6-4170-9DE9-0DFB83F49193} 1234.0 ser123 sfreder
{8AEAF485-A4FF-460C-91FA-0DFCAD79DD24} 3456.0 ser124 10210277
{E6204B69-DABB-4A1E-906B-0DFD2BCEDA41} 456.0 ser345 A313819
{445EC096-A70C-47D1-91FF-0DFE747F762A} 4485.0 ser900 dgs1sj
df2:
Asset_Tag_Number Serial_Number System_Name
wrkStaId
{ED0CCFFD-28D6-4170-9DE9-0DFB83F49193} 1234.0 ser123 sfreder
{8AEAF485-A4FF-460C-91FA-0DFCAD79DD24} 3456.0 ser124 10210277
{E6204B69-DABB-4A1E-906B-0DFD2BCEDA41} 1709.0 ser345 A313819
{445EC096-A70C-47D1-91FF-0DFE747F762A} 4485.0 ser900 dgs1sj
You can now do the set differences on the pandas df's like this:
>>> df1[~df1.isin(df2).all(1)]
Asset_Tag_Number Serial_Number System_Name
wrkStaId
{E6204B69-DABB-4A1E-906B-0DFD2BCEDA41} 456.0 ser345 A313819
>>> df2[~df2.isin(df1).all(1)]
Asset_Tag_Number Serial_Number System_Name
wrkStaId
{E6204B69-DABB-4A1E-906B-0DFD2BCEDA41} 1709.0 ser345 A313819
You may need to adapt the code a little to get exactly what you want, but I hope this gets you going.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With