I have a data frame with with several columns, one of which is company_name. I'm trying to remove duplicate records based on them having the same company_name, but I'm at a loss on how to do this while maintaining the original case of the company_name.
I've tried converting all names to lowercase and then removing duplicates, but as mentioned before I would like to maintain the original case.
df = pd.DataFrame({'company_name': ['Apple', 'apple', 'BlackBerry', 'blackberry','Blackberry'])
df['company_name'] = df['company_name'].str.strip()
df['company_name'] = df['company_name'].str.lower()
input_records.drop_duplicates(subset=['company_id'], inplace=True,
keep="first")
this code gets rid of duplicates, but stores the company name in all lowercase.
Desired result
company_name
Apple
BlackBerry
Actual result
company_name
apple
blackberry
You could create a new column where you do your operations. This will allow you to keep the company name column untouched.
Here is what it could look like :
df = pd.DataFrame({'company_name':['Apple','apple','Blackberry','blackberry']})
company_name
0 Apple
1 apple
2 Blackberry
3 blackberry
df['company_name_2'] = df['company_name'].str.strip()
df['company_name_2'] = df['company_name_2'].str.lower()
df.drop_duplicates(subset=['company_name_2'], inplace=True,
keep="first")
df.drop(columns = 'company_name_2', inplace = True)
df.reset_index(inplace = True, drop = True)
Out :
company_name
0 Apple
1 Blackberry
How about using duplicated
yourdf=df[~df.company_name.str.strip().str.lower().duplicated(keep='first')]
yourdf
Out[415]:
company_name
0 Apple
2 BlackBerry
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With