I've got a large set of data where I'm trying to remove duplicates based on 2 fields. Sample set:
WOE_ID ISO Locationname Language Placetype Parent_ID ID Username
2347578 US Maine ENG State 23424977 1 sampleuser
2444322 US Maine ENG Town 12588275 1 sampleuser
2444324 US Maine ENG Town 12588852 1 sampleuser
2444326 US Maine ENG POI 12589403 1 sampleuser
2444327 US Maine ENG Town 12587582 1 sampleuser
2444325 US Maine ENG Country 12589315 1 sampleuser
28744443US Maine ENG Town 12590578 1 sampleuser
2444323 US Maine ENG Town 2374968 1 sampleuser
As these are all duplicate values of ID (1) I want to keep only the entry the biggest Placetype (being country here, where country>state>town>POI). Is there a simple way to do this that I'm overlooking or do I have to write a loop that compares all the entries? Id rather not do that because there's over 3 million entries in the total database and I will probably have to run it several times.
Thanks in advance!
I think you can use ordered Categorical, then sort DataFrame by column Placetype by sort_values and then groupby with aggregating first:
print (df)
WOE_ID ISO Locationname Language Placetype Parent_ID ID Username
0 2347578 US Maine ENG State 23424977 1 sampleuser
1 2444322 US Maine ENG Town 12588275 1 sampleuser
2 2444324 US Maine ENG Town 12588852 1 sampleuser
3 2444326 US Maine ENG POI 12589403 2 sampleuser
4 2444327 US Maine ENG Town 12587582 2 sampleuser
5 2444325 US Maine ENG Country 12589315 3 sampleuser
6 28744443 US Maine ENG Town 12590578 3 sampleuser
7 2444323 US Maine ENG Town 2374968 3 sampleuser
df.Placetype = pd.Categorical(df.Placetype,
categories=['Country','State','Town','POI'],
ordered=True)
df = df.sort_values('Placetype').groupby('ID', as_index=False).first()
print (df)
ID WOE_ID ISO Locationname Language Placetype Parent_ID Username
0 1 2347578 US Maine ENG State 23424977 sampleuser
1 2 2444327 US Maine ENG Town 12587582 sampleuser
2 3 2444325 US Maine ENG Country 12589315 sampleuser
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With