Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

removing stop words using spacy

I am cleaning a column in my data frame, Sumcription, and am trying to do 3 things:

  1. Tokenize
  2. Lemmantize
  3. Remove stop words

    import spacy        
    nlp = spacy.load('en_core_web_sm', parser=False, entity=False)        
    df['Tokens'] = df.Sumcription.apply(lambda x: nlp(x))    
    spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS        
    spacy_stopwords.add('attach')
    df['Lema_Token']  = df.Tokens.apply(lambda x: " ".join([token.lemma_ for token in x if token not in spacy_stopwords]))
    

However, when I print for example:

df.Lema_Token.iloc[8]

The output still has the word attach in it: attach poster on the wall because it is cool

Why does it not remove the stop word?

I also tried this:

df['Lema_Token_Test']  = df.Tokens.apply(lambda x: [token.lemma_ for token in x if token not in spacy_stopwords])

But the str attach still appears.

like image 402
Nelly Yuki Avatar asked Apr 23 '19 18:04

Nelly Yuki


People also ask

How do I remove stop words from spaCy?

iii) Remove Stopwords using Spacy Then we create an empty list to store words that are not stopwords. Using a for loop that iterates over the text (that has been split on whitespace) we check whether the word is present in the stopword list, if not we append it in the list.

Does Lemmatization remove stop words?

lemma_ is applied to the token after the token is checked for being a stop-word or not. Therefore, if the stop-word is not in the lemmatized form, it will not be considered stop word.

What is stop word removal?

Stop word removal is one of the most commonly used preprocessing steps across different NLP applications. The idea is simply removing the words that occur commonly across all the documents in the corpus. Typically, articles and pronouns are generally classified as stop words.


1 Answers

import spacy
import pandas as pd

# Load spacy model
nlp = spacy.load('en', parser=False, entity=False)        

# New stop words list 
customize_stop_words = [
    'attach'
]

# Mark them as stop words
for w in customize_stop_words:
    nlp.vocab[w].is_stop = True


# Test data
df = pd.DataFrame( {'Sumcription': ["attach poster on the wall because it is cool",
                                   "eating and sleeping"]})

# Convert each row into spacy document and return the lemma of the tokens in 
# the document if it is not a sotp word. Finally join the lemmas into as a string
df['Sumcription_lema'] = df.Sumcription.apply(lambda text: 
                                          " ".join(token.lemma_ for token in nlp(text) 
                                                   if not token.is_stop))

print (df)

Output:

   Sumcription                                   Sumcription_lema
0  attach poster on the wall because it is cool  poster wall cool
1                           eating and sleeping         eat sleep
like image 169
mujjiga Avatar answered Sep 19 '22 15:09

mujjiga