Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to fix a Key error when trying to get an Excel diff with Python

I'm trying to compare two excel documents and get the difference in a new doc. I found a tutorial, read it and tried out the exact code from it, with sample docs from the author's GitHub (clicky: https://pbpython.com/excel-diff-pandas-update.html). It works, but when I try to change values inside the spreadsheets to random ones (I will need less columns so I was fiddling around with the original file), I keep getting an error:

KeyError: 'passes columns are not ALL present dataframe'

I keep all the headers, and I'm getting the error no matter if I leave some cells blank or if I fill them with text. What am I doing wrong?

I failed to find an exact same error which is why I'm posting this question.

This works fine:


import pandas as pd


# Define the diff function to show the changes in each field
def report_diff(x):
    return x[0] if x[0] == x[1] else '{} ---> {}'.format(*x)

# Read in the two files but call the data old and new and create columns to track
old = pd.read_excel('C:\\Users\\Olga\\Documents\\sample-address-1.xlsx', 'Sheet1', na_values=['NA'])
new = pd.read_excel('C:\\Users\\Olga\\Documents\\sample-address-2.xlsx', 'Sheet1', na_values=['NA'])
old['version'] = "old"
new['version'] = "new"

old_accts_all = set(old['account number'])
new_accts_all = set(new['account number'])

dropped_accts = old_accts_all - new_accts_all
added_accts = new_accts_all - old_accts_all

all_data = pd.concat([old,new],ignore_index=True)
changes = all_data.drop_duplicates(subset=["account number",
                                           "name", "street",
                                           "city","state",
                                           "postal code"], keep='last')

dupe_accts = changes[changes['account number'].duplicated() == True]['account number'].tolist()
dupes = changes[changes["account number"].isin(dupe_accts)]

# Pull out the old and new data into separate dataframes
change_new = dupes[(dupes["version"] == "new")]
change_old = dupes[(dupes["version"] == "old")]

# Drop the temp columns - we don't need them now
change_new = change_new.drop(['version'], axis=1)
change_old = change_old.drop(['version'], axis=1)

# Index on the account numbers
change_new.set_index('account number', inplace=True)
change_old.set_index('account number', inplace=True)

# Combine all the changes together
df_all_changes = pd.concat([change_old, change_new],
                            axis='columns',
                            keys=['old', 'new'],
                            join='outer')

# Define the diff function to show the changes in each field
def report_diff(x):
    return x[0] if x[0] == x[1] else '{} ---> {}'.format(*x)

df_all_changes = df_all_changes.swaplevel(axis='columns')[change_new.columns[0:]]

df_changed = df_all_changes.groupby(level=0, axis=1).apply(lambda frame: frame.apply(report_diff, axis=1))
df_changed = df_changed.reset_index()

df_removed = changes[changes["account number"].isin(dropped_accts)]
df_added = changes[changes["account number"].isin(added_accts)]

output_columns = ["account number", "name", "street", "city", "state", "postal code"]
writer = pd.ExcelWriter("my-diff.xlsx")
df_changed.to_excel(writer,"changed", index=False, columns=output_columns)
df_removed.to_excel(writer,"removed",index=False, columns=output_columns)
df_added.to_excel(writer,"added",index=False, columns=output_columns)
writer.save()

But then I change the values and:


Traceback (most recent call last):
  File "C:\Users\Olga\Documents\exceldiff.py", line 61, in <module>
    df_changed.to_excel(writer,"changed", index=False, columns=output_columns)
  File "C:\Python37\lib\site-packages\pandas\core\generic.py", line 2248, in to_excel
    inf_rep=inf_rep,
  File "C:\Python37\lib\site-packages\pandas\io\formats\excel.py", line 392, in __init__
    raise KeyError("passes columns are not ALL present dataframe")
KeyError: 'passes columns are not ALL present dataframe'

like image 311
Enid Avatar asked Mar 31 '26 09:03

Enid


1 Answers

Sorry to resurrect this, but I just got this error and found little documentation on it. This error should be phrased differently. 'passes columns are not ALL present dataframe' does not make sense in English. What it means is "passed columns are not all present in dataframe". That is, you are trying to pass a column or columns which do not exist in your dataframe. Check for typos or trailing/leading spaces in your column names.

like image 134
edutt Avatar answered Apr 02 '26 22:04

edutt