Pandas dataframe - how to eliminate duplicate words in a column

Question

I have a pandas dataframe:

import pandas as pd

df = pd.DataFrame({'category':[0,1,2],
                   'text': ['this is some text for the first row',
                            'second row has this text',
                            'third row this is the text']})
df.head()

I would like to get the following result (without words repeating in each row):

Expected result (for the example above):

category     text
0            is some for the first
1            second has
2            third is the

With the following code I tried to get all data in rows to a string:

final_list =[]
for index, rows in df.iterrows():
    # Create list for the current row
    my_list =rows.text
    # append the list to the final list
    final_list.append(my_list)
# Print the list
print(final_list)
text=''

for i in range(len(final_list)):
    text+=final_list[i]+', '

print(text)

The idea in this question (pandas dataframe- how to find words that repeat in each row) does not help me to get the expected result.

arr = [set(x.split()) for x in text.split(',')]
mutual_words = set.intersection(*arr)
result = [list(x.difference(mutual_words)) for x in arr]
result = sum(result, [])
final_text = (", ").join(result)
print(final_text)

Does anyone have an idea how to get it?

Shubham Sharma · Accepted Answer

You can use Series.str.split to split the column text around the delimiter space then use reduce to get the intersection of the words found in all the rows, finally use str.replace to remove the common words:

from functools import reduce

w = reduce(lambda x, y: set(x) & set(y), df['text'].str.split())
df['text'] = df['text'].str.replace(rf"(\s*)(?:{'|'.join(w)})\s*", r'\1').str.strip()

   category                    text
0         0   is some for the first
1         1              second has
2         2            third is the

Pandas dataframe - how to eliminate duplicate words in a column

Tags:

python

pandas

dataframe

mark

1 Answers

Shubham Sharma

Recent Activity

Donate For Us

Pandas dataframe - how to eliminate duplicate words in a column

Tags:

python

pandas

dataframe

mark

1 Answers

Shubham Sharma

Related questions

Recent Activity

Donate For Us