I have a pandas dataframe:
import pandas as pd
df = pd.DataFrame({'category':[0,1,2],
'text': ['this is some text for the first row',
'second row has this text',
'third row this is the text']})
df.head()
I would like to get the following result (without words repeating in each row):
Expected result (for the example above):
category text
0 is some for the first
1 second has
2 third is the
With the following code I tried to get all data in rows to a string:
final_list =[]
for index, rows in df.iterrows():
# Create list for the current row
my_list =rows.text
# append the list to the final list
final_list.append(my_list)
# Print the list
print(final_list)
text=''
for i in range(len(final_list)):
text+=final_list[i]+', '
print(text)
The idea in this question (pandas dataframe- how to find words that repeat in each row) does not help me to get the expected result.
arr = [set(x.split()) for x in text.split(',')]
mutual_words = set.intersection(*arr)
result = [list(x.difference(mutual_words)) for x in arr]
result = sum(result, [])
final_text = (", ").join(result)
print(final_text)
Does anyone have an idea how to get it?
You can use Series.str.split
to split the column text
around the delimiter space then use reduce
to get the intersection of the words found in all the rows, finally use str.replace
to remove the common words:
from functools import reduce
w = reduce(lambda x, y: set(x) & set(y), df['text'].str.split())
df['text'] = df['text'].str.replace(rf"(\s*)(?:{'|'.join(w)})\s*", r'\1').str.strip()
category text
0 0 is some for the first
1 1 second has
2 2 third is the
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With