I have more rows in the resultant left join than are in the left dataframe.
# Importing Pandas and changing it's call to pd
import numpy as np
import pandas as pd
SalesDF = pd.read_csv(r"C:\Users\USER\Documents\Reports\SalesForAnalysis.csv")
print("This is the Sales shape")
print(SalesDF.shape)
CustInfoDF = pd.read_csv(r"C:\Users\USER\Documents\Cust.csv")
# This reassigns the df so that the rows with a NaN in the Account Number it doesn't appear
CustInfoDF = CustInfoDF[CustInfoDF['Account Number'].notna()]
# Merges the two dataframes on SalesDF with "Cust Number" as the key
MergeDF = pd.merge(SalesDF, CustInfoDF, how="left", left_on="Cust Number", right_on="Account Number")
print("This is the Merge Shape ")
print(MergeDF.shape)
# Reduced the number of columns to the selected columns
CutDF = MergeDF[["Customer", "Invoice #", "E-mail Address", "Phone", "Clerk", "Total", "Date"]]
CutDF.drop_duplicates()
print("This is the Cut shape ")
print(CutDF.shape)
Here is the result after running the program
This is the Sales shape
(5347, 61)
This is the Merge Shape
(6428, 83)
This is the Cut shape
(6428, 7)
Process finished with exit code 0
The CutDF should only have a max of 5347 rows. I have a drop_duplicates method in there but I still get the same results.
I saw this pandas left join - why more results? and inner join/merge in pandas dataframe give more rows than left dataframe but I didn't really see a solution to this in these though.
Any help would be appreciated.
Before you execute:
MergeDF = pd.merge(SalesDF, CustInfoDF, how="left", left_on="Cust Number", right_on="Account Number")
Can you do:
CustInfoDF = CustInfoDF.drop_duplicates(subset=["Account Number"])
I have a suspicion that your CustInforDF has multiple entries for each Account Number.
If this doesn't work, can you please post the sample data frames? Feel free to add/replace with dummy values as long as the code is replicable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With