This seems simple, but I can not find any information on it on the internet.
I have a dataframe like below:
City    State Zip           Date        Description        Earlham IA    50072-1036    2014-10-10  Postmarket Assurance: Devices Earlham IA    50072-1036    2014-10-10  Compliance: Devices Madrid  IA    50156-1748    2014-09-10  Drug Quality Assurance  How can I eliminate duplicates that match 4 of 5 columns? The column not matching being Description.
The result would be
City    State Zip           Date        Description        Earlham IA    50072-1036    2014-10-10  Postmarket Assurance: Devices Madrid  IA    50156-1748    2014-09-10  Drug Quality Assurance  I found online that drop_duplicates with the subset parameter could work, but I am unsure of how I can apply it to multiple columns.
To drop duplicate columns from pandas DataFrame use df. T. drop_duplicates(). T , this removes all columns that have the same data regardless of column names.
To remove duplicates of only one or a subset of columns, specify subset as the individual column or list of columns that should be unique. To do this conditional on a different column's value, you can sort_values(colname) and specify keep equals either first or last .
Return DataFrame with duplicate rows removed. Considering certain columns is optional. Indexes, including time indexes are ignored.
You've actually found the solution. For multiple columns, subset will be a list.
df.drop_duplicates(subset=['City', 'State', 'Zip', 'Date'])    Or, just by stating the column to be ignored:
df.drop_duplicates(subset=df.columns.difference(['Description'])) 
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With