I want to remove stopwords from the Data column in my file.
I filtered out the line for when the end-user is speaking.
But it doesn't filter out the stopwords with the usertext.apply(lambda x: [word for word in x if word not in stop_words])
what am i doing wrong?
import pandas as pd
from stop_words import get_stop_words
df = pd.read_csv("F:/textclustering/data/cleandata.csv", encoding="iso-8859-1")
usertext = df[df.Role.str.contains("End-user",na=False)][['Data','chatid']]
stop_words = get_stop_words('dutch')
clean = usertext.apply(lambda x: [word for word in x if word not in stop_words])
print(clean)
You can build a regex pattern of your stop words and call the vectorised str.replace
to remove them:
In [124]:
stop_words = ['a','not','the']
stop_words_pat = '|'.join(['\\b' + stop + '\\b' for stop in stop_words])
stop_words_pat
Out[124]:
'\\ba\\b|\\bnot\\b|\\bthe\\b'
In [125]:
df = pd.DataFrame({'text':['a to the b', 'the knot ace a']})
df['text'].str.replace(stop_words_pat, '')
Out[125]:
0 to b
1 knot ace
Name: text, dtype: object
here we perform a list comprehension to build a pattern surrounding each stop word with '\b'
which is a break and then we or
all words using '|'
Two issues:
First, you have a module called stop_words
and you later create a variable named stop_words
. This is bad form.
Second, you are passing a lambda-function to .apply
that wants its x
parameter to be a list, rather than a value within a list.
That is, instead of doing df.apply(sqrt)
you are doing df.apply(lambda x: [sqrt(val) for val in x])
.
You should either do the list-processing yourself:
clean = [x for x in usertext if x not in stop_words]
Or you should do the apply, with a function that takes one word at a time:
clean = usertext.apply(lambda x: x if x not in stop_words else '')
As @Jean-François Fabre suggested in a comment, you can speed things up if your stop_words is a set rather than a list:
from stop_words import get_stop_words
nl_stop_words = set(get_stop_words('dutch')) # NOTE: set
usertext = ...
clean = usertext.apply(lambda word: word if word not in nl_stop_words else '')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With