Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Automating Winmerge comparison in Python

I require an assistant with coding a comparison between two tables in python, that is currently done in winmerge.

The code is as follows

import pandas as pd

Last week's table

df1=pd.read_csv(r"C:\Users\ri0a\OneDrive - Department of Environment, Land, Water and Planning\Python practice\pvmodules+_210326.csv")

This week table with new model numbers, and expire dates

df2=pd.read_csv(r"C:\Users\ri0a\OneDrive - Department of Environment, Land, Water and Planning\Python practice\pvmodules+_210401.csv")

The table head is as below

enter image description here

the third column is PV_module certificate: Expiry date. I want to set a logic similar to excel logic '=IF (D2<DATEVALUE("19/04/2021"),"Expired","OK). The objective here is to delete the entire rows where the expiry date is below a specific date/ today's date.

Next,Importing dataframe_diff package

from dataframe_diff import dataframe_diff

Executing the difference

d1_column,d2_additional=dataframe_diff(df1,df2,key=['PV Module Certificate: Licensee/Certificate Holder Account','Model Number/s'])

With this package d2_additional shows if there are new rows associated with model numbers added this week compared to last week. enter image description here

However, I am trying to replicate the following output

enter image description here

The tasks involved are

  1. If some model, in this case a row, was included in the last week's table, but is missing in current week's table, I want to assign a new field "Expired" in a new column "Status" next to it./ Or create a new dataframe, d2_expires, from only those missing rows.
  2. Another dataframe, where the rows or the product models that were missing last week but added this week remains...As d2_additional.
  3. A third dataframe, where any changes (for example expiry date) for same rows (same cerificate + same model but different new expiry date) is captures as d3_comparison.

Please kindly help me with this.

Thanks in advance.

Now: as with

d2_expires = merged_df[merged_df._merge == 'left_only']

and with

d2_additional = merged_df[merged_df._merge == 'right_only']

I get the same output. Same rows are returned, which should not be the case. As seen from the screen below

Expires output

This is the same as additions d2_aDDITIONS IS THE SAME OUTPUT AS D2_EXPIRES

And Finally, I get an error with d2_comaprison.

d3_comparison = merged_df[merged_df._merge == 'both'].\
        loc[lambda x: x.PV Module Certificate: Expiry Date_last_week != x.PV Module Certificate: Expiry Date_this_week]

error

I am happy to share my data with you. Please email me at [email protected] for data sharing.

like image 906
Mihan Avatar asked Jul 07 '20 22:07

Mihan


1 Answers

You have to ensure to convert the dates into datetime format after loading the data, and rename the columns to something easier to work with (for example 'cert_holder', 'model_no','approval_date','expiry_date')

I want to set a logic similar to excel logic '=IF (D2<DATEVALUE("19/04/2021"),"Expired","OK). The objective here is to delete the entire rows where the expiry date is below a specific date/ today's date.

This (removing) can be done with:

df = df[df['expiry_date'] >= pd.Timestamp('today')]

# Or

df = df[df['expiry_date'] >= pd.Timestamp('2021-04-23')]

But this only works, if your expiry dates are in datetime format.

Next merge the two dataframes:

merged_df = pd.merge(df1,df2, how='outer', on=['cert_holder','model_no'],\
                     suffixes=['_last_week','_this_week'], indicator=True)

If some model, in this case a row, was included in the last week's table, but is missing in current week's table, I want to assign a new field "Expired" in a new column "Status" next to it./ Or create a new dataframe, d2_expires, from only those missing rows.

d2_expires = merged_df[merged_df._merge == 'left_only']

Another dataframe, where the rows or the product models that were missing last week but added this week remains...As d2_additional.

d2_additional = merged_df[merged_df._merge == 'right_only']

A third dataframe, where any changes (for example expiry date) for same rows (same certificate + same model but different new expiry date) is captures as d3_comparison.

d3_comparison = merged_df[merged_df._merge == 'both'].\
        loc[lambda x: x.expiry_date_last_week != x.expiry_date_this_week]
like image 63
Betelgeux Avatar answered Nov 04 '22 02:11

Betelgeux