I am cleaning a column in my data frame
, Sumcription, and am trying to do 3 things:
Remove stop words
import spacy
nlp = spacy.load('en_core_web_sm', parser=False, entity=False)
df['Tokens'] = df.Sumcription.apply(lambda x: nlp(x))
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
spacy_stopwords.add('attach')
df['Lema_Token'] = df.Tokens.apply(lambda x: " ".join([token.lemma_ for token in x if token not in spacy_stopwords]))
However, when I print for example:
df.Lema_Token.iloc[8]
The output still has the word attach in it:
attach poster on the wall because it is cool
Why does it not remove the stop word?
I also tried this:
df['Lema_Token_Test'] = df.Tokens.apply(lambda x: [token.lemma_ for token in x if token not in spacy_stopwords])
But the str attach
still appears.
iii) Remove Stopwords using Spacy Then we create an empty list to store words that are not stopwords. Using a for loop that iterates over the text (that has been split on whitespace) we check whether the word is present in the stopword list, if not we append it in the list.
lemma_ is applied to the token after the token is checked for being a stop-word or not. Therefore, if the stop-word is not in the lemmatized form, it will not be considered stop word.
Stop word removal is one of the most commonly used preprocessing steps across different NLP applications. The idea is simply removing the words that occur commonly across all the documents in the corpus. Typically, articles and pronouns are generally classified as stop words.
import spacy
import pandas as pd
# Load spacy model
nlp = spacy.load('en', parser=False, entity=False)
# New stop words list
customize_stop_words = [
'attach'
]
# Mark them as stop words
for w in customize_stop_words:
nlp.vocab[w].is_stop = True
# Test data
df = pd.DataFrame( {'Sumcription': ["attach poster on the wall because it is cool",
"eating and sleeping"]})
# Convert each row into spacy document and return the lemma of the tokens in
# the document if it is not a sotp word. Finally join the lemmas into as a string
df['Sumcription_lema'] = df.Sumcription.apply(lambda text:
" ".join(token.lemma_ for token in nlp(text)
if not token.is_stop))
print (df)
Output:
Sumcription Sumcription_lema
0 attach poster on the wall because it is cool poster wall cool
1 eating and sleeping eat sleep
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With