Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas, drop duplicates but merge certain columns

Tags:

python

pandas

I'm looking for a way to drop duplicate rows based one a certain column subset, but merge some data, so it does not get removed.

import pandas as pd

# Example Dataframe
data = {
    "Parcel": ['001', '002', '003', '003'],
    "Res"   : ['Henry', 'Nick', 'Paul', 'Bill'],
    "Bill"  : ['4,100', '2,300', '5,200', '4,000'],
    "Year"  : ['1995', '1990', '2008', '2008']
}
df = pd.DataFrame.from_dict(data=data)
Parcel Res Bill Year
001 Henry 4,100 1995
002 Nick 2,300 1990
003 Paul 5,200 2008
003 Bill 4,000 2008

Some pseudo code would look something like this:

df = df.drop_duplicates(subset='Parcel', keep_data=['Res', 'Bill'])
Parcel Res Bill Year
001 Henry 4,100 1995
002 Nick 2,300 1990
003 Paul, Bill 5,200, 4,000 2008

I'm not sure where to begin with this, but any tips as to where to look would be appreciated.

like image 992
Tyler Singleton Avatar asked Feb 01 '26 20:02

Tyler Singleton


1 Answers

You can use .groupby with .agg:

df = (
    df.groupby("Parcel")
    .agg({"Res": ", ".join, "Bill": ", ".join, "Year": "first"})
    .reset_index()
)
print(df)

Prints:

  Parcel         Res          Bill  Year
0    001       Henry         4,100  1995
1    002        Nick         2,300  1990
2    003  Paul, Bill  5,200, 4,000  2008

EDIT: If you have many columns, you can aggregate all values by "first" and then update the dataframe:

g = df.groupby("Parcel")
x = g.agg("first")
x.update(g.agg({"Res": ", ".join, "Bill": ", ".join}))
print(x.reset_index())

  Parcel         Res          Bill  Year
0    001       Henry         4,100  1995
1    002        Nick         2,300  1990
2    003  Paul, Bill  5,200, 4,000  2008
like image 158
Andrej Kesely Avatar answered Feb 03 '26 10:02

Andrej Kesely