Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove duplicates from python dataframe list

I have a pandas df where each row is a list of words. The list has duplicate words. I want to remove duplicate words.

I tried using dict.fromkeys(listname) in a for loop to iterate over each row in the df. But this splits the words into alphabets

filepath = "C:/abc5/Python/Clustering/output2.csv"
df = pd.read_csv(filepath,encoding='windows-1252')

df["newlist"] = df["text_lemmatized"]
for i in range(0,len(df)):
    l = df["text_lemmatized"][i]
    df["newlist"][i] = list(dict.fromkeys(l))

print(df)

Expected result is ==>

['clear', 'pending', 'order', 'pending', 'order']   ['clear', 'pending', 'order']
 ['pending', 'activation', 'clear', 'pending']   ['pending', 'activation', 'clear']

Actual result is

['clear', 'pending', 'order', 'pending', 'order']  ...   [[, ', c, l, e, a, r, ,,  , p, n, d, i, g, o, ]]
['pending', 'activation', 'clear', 'pending', ...  ...  [[, ', p, e, n, d, i, g, ,,  , a, c, t, v, o, ...
like image 392
Anoop Mahajan Avatar asked Apr 25 '26 01:04

Anoop Mahajan


1 Answers

Use set to remove duplicates.

Also you don't need the for loop

  df["newlist"] = list(set( df["text_lemmatized"] ))
like image 56
Anthony Kong Avatar answered Apr 26 '26 17:04

Anthony Kong



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!