Desire:
I want a way to merge two data frames and keep the non-intersected data from a specified data frame.
Problem:
I have duplicate data and I expected this line to remove that duplicate data:
final_df = new_df[~new_df.isin(previous_df)].dropna()
Example data and data test:
record = Record(1000, 9300815, '<redacted type>', '<redacted id>')
test_df = pd.DataFrame([record])
if not final_df.empty:
# this produces an empty data frame
empty_df = test_df[test_df.isin(final_df)].dropna()
# this produces the record
record_df = final_pdf[final_pdf.col01 == record.col01]
Background:
I'm loading xml data and converting the xml file into several different record types as namedtuples. I split each record type into its own dataframe. I then compare the current set of data from the xml file with the data already loaded into the database by constructing previous_df
as such:
previous_df = pd.read_sql_table(table_name, con=conn, schema=schema, columns=columns)
Columns are dynamically created based on the fields in the named tuple. The database schema is generated using sqlalchemy, and I have added UniqueConstraint
to manage when I think there are duplicates within the database.
Thanks in advance for any help provided.
PRESERVING SINGLE RECORDS FROM BOTH DATAFRAMES:
Try to concat
the dataframes first, so you are sure that you will have duplicates. Then apply drop_duplicates
and I think you will end up with what you are after. See the example below:
#Create dummy data
df1 = pd.DataFrame(columns=["A","B"],data=[[1,2],[3,4],[5,6]])
print(df1)
A B
0 1 2
1 3 4
2 5 6
df2 = pd.DataFrame(columns=["A","B"],data=[[3,4],[5,6],[7,8],[9,10]])
print(df2)
A B
0 3 4
1 5 6
2 7 8
3 9 10
#Concatenate dataframes
df = pd.concat([df1,df2],axis=0)
print(df)
A B
0 1 2
1 3 4
2 5 6
0 3 4
1 5 6
2 7 8
3 9 10
#Drop duplicates
df = df.drop_duplicates(keep=False)
print(df)
A B
0 1 2
2 7 8
3 9 10
PRESERVING SINGLE RECORDS FROM ONE DATAFRAME ONLY:
If you only want to keep the data from the new dataframe, just use a dirty little trick: concat
the old dataframe twice, so all old records will fall under drop_duplicates criteria. Like so:
#Concatenate dataframes with old dataframe taken twice!
df = pd.concat([df1,df1,df2],axis=0)
#Now you will only end up with the records from second dataframe
df = df.drop_duplicates(keep=False)
print(df)
A B
2 7 8
3 9 10
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With