I want to remove duplicate rows from the dataframe based on values in two columns: Column1
and Column2
If dataframe
is:
df = pd.DataFrame({'Column1': ["'cat'", "'toy'", "'cat'"],
'Column2': ["'bat'", "'flower'", "'bat'"],
'Column3': ["'xyz'", "'abc'", "'lmn'"]})
On using:
result_df = df.drop_duplicates(subset=['Column1', 'Column2'], keep='first')
print(result_df)
I get:
Column1 Column2 Column3
0 'cat' 'bat' 'xyz'
1 'toy' 'flower' 'abc'
But using same code for dataframe
(Cat and Bat cases changed)
df = pd.DataFrame({'Column1': ["'Cat'", "'toy'", "'cat'"],
'Column2': ["'Bat'", "'flower'", "'bat'"],
'Column3': ["'xyz'", "'abc'", "'lmn'"]})
I get:
Column1 Column2 Column3
0 'Cat' 'Bat' 'xyz'
1 'toy' 'flower' 'abc'
2 'cat' 'bat' 'lmn'
Expected Output:
Column1 Column2 Column3
0 'cat' 'bat' 'xyz'
1 'toy' 'flower' 'abc'
How can this comparison be done case insensitively?
I figured it out. Create new uppercase columns and then use them to remove the duplicates. Once done, drop the uppercase columns.
df = pd.DataFrame({'Column1': ["'Cat'", "'toy'", "'cat'"],
'Column2': ["'Bat'", "'flower'", "'bat'"],
'Column3': ["'xyz'", "'abc'", "'lmn'"]})
df['Column1_Upper'] = df['Column1'].astype(str).str.upper()
df['Column2_Upper'] = df['Column2'].astype(str).str.upper()
This gives:
+---+---------+----------+---------+---------------+---------------+
| | Column1 | Column2 | Column3 | Column1_Upper | Column2_Upper |
+---+---------+----------+---------+---------------+---------------+
| 0 | 'Cat' | 'Bat' | 'xyz' | 'CAT' | 'BAT' |
| 1 | 'toy' | 'flower' | 'abc' | 'TOY' | 'FLOWER' |
| 2 | 'cat' | 'bat' | 'lmn' | 'CAT' | 'BAT' |
+---+---------+----------+---------+---------------+---------------+
Finally, run the below to drop the duplicates and created columns.
result_df = df.drop_duplicates(subset=['Column1_Upper', 'Column2_Upper'], keep='first')
result_df.drop(['Column1_Upper', 'Column2_Upper'], axis=1, inplace=True)
print(result_df)
This gives:
+-----------------------------+
| Column1 Column2 Column3 |
+-----------------------------+
| 0 'Cat' 'Bat' 'xyz' |
| 1 'toy' 'flower' 'abc' |
+-----------------------------+
You could convert the dataframe to lower case and then apply your solution.
Your dataframe.
df = pd.DataFrame({'Column1': ["'Cat'", "'toy'", "'cat'"],
'Column2': ["'Bat'", "'flower'", "'bat'"],
'Column3': ["'xyz'", "'abc'", "'lmn'"]})
print(df)
Column1 Column2 Column3
0 'Cat' 'Bat' 'xyz'
1 'toy' 'flower' 'abc'
2 'cat' 'bat' 'lmn'
Then apply lower string.
result_df = df.apply(lambda x: x.astype(str).str.lower()).drop_duplicates(subset=['Column1', 'Column2'], keep='first')
print(result_df)
Column1 Column2 Column3
0 'cat' 'bat' 'xyz'
1 'toy' 'flower' 'abc'
Then filter df for upper case.
df.loc[result_df.index]
Column1 Column2 Column3
0 'Cat' 'Bat' 'xyz'
1 'toy' 'flower' 'abc'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With