Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas dataframe - how to eliminate duplicate words in a column

I have a pandas dataframe:

import pandas as pd

df = pd.DataFrame({'category':[0,1,2],
                   'text': ['this is some text for the first row',
                            'second row has this text',
                            'third row this is the text']})
df.head()

I would like to get the following result (without words repeating in each row):

Expected result (for the example above):

category     text
0            is some for the first
1            second has
2            third is the

With the following code I tried to get all data in rows to a string:

final_list =[]
for index, rows in df.iterrows():
    # Create list for the current row
    my_list =rows.text
    # append the list to the final list
    final_list.append(my_list)
# Print the list
print(final_list)
text=''

for i in range(len(final_list)):
    text+=final_list[i]+', '

print(text)

The idea in this question (pandas dataframe- how to find words that repeat in each row) does not help me to get the expected result.

arr = [set(x.split()) for x in text.split(',')]
mutual_words = set.intersection(*arr)
result = [list(x.difference(mutual_words)) for x in arr]
result = sum(result, [])
final_text = (", ").join(result)
print(final_text)

Does anyone have an idea how to get it?

like image 624
mark Avatar asked Oct 15 '22 00:10

mark


1 Answers

You can use Series.str.split to split the column text around the delimiter space then use reduce to get the intersection of the words found in all the rows, finally use str.replace to remove the common words:

from functools import reduce

w = reduce(lambda x, y: set(x) & set(y), df['text'].str.split())
df['text'] = df['text'].str.replace(rf"(\s*)(?:{'|'.join(w)})\s*", r'\1').str.strip()

   category                    text
0         0   is some for the first
1         1              second has
2         2            third is the 
like image 100
Shubham Sharma Avatar answered Oct 19 '22 03:10

Shubham Sharma