I want to achieve something like in this post: Python Dataframe: Remove duplicate words in the same cell within a column in Python, but for the entire dataframe in a efficient way.
My data looks something like this: It is a pandas data frame with a lot of columns. It has comma separated strings where there are a lot of duplicates - and I wish to remove all duplicates within those individual strings.
+--------------------+---------+---------------------+
| Col1 | Col2 | Col3 |
+--------------------+---------+---------------------+
| Dog, Dog, Dog | India | Facebook, Instagram |
| Dog, Squirrel, Cat | Norway | Facebook, Facebook |
| Cat, Cat, Cat | Germany | Twitter |
+--------------------+---------+---------------------+
Reproducable example:
df = pd.DataFrame({"col1": ["Dog, Dog, Dog", "Dog, Squirrel, Cat", "Cat, Cat, Cat"],
"col2": ["India", "Norway", "Germany"],
"col3": ["Facebook, Instagram", "Facebook, Facebook", "Twitter"]})
I would like it to transform it to this:
+--------------------+---------+---------------------+
| Col1 | Col2 | Col3 |
+--------------------+---------+---------------------+
| Dog | India | Facebook, Instagram |
| Dog, Squirrel, Cat | Norway | Facebook |
| Cat | Germany | Twitter |
+--------------------+---------+---------------------+
Try:
for col in ["col1", "col2", "col3"]:
df[col]=df[col].str.split(", ").map(set).str.join(", ")
Outputs:
>>> df
col1 col2 col3
0 Dog India Facebook, Instagram
1 Dog, Cat, Squirrel Norway Facebook
2 Cat Germany Twitter
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With