removing stop words using spacy

Tags:

I am cleaning a column in my data frame, Sumcription, and am trying to do 3 things:

Tokenize
Lemmantize

Remove stop words

import spacy        
nlp = spacy.load('en_core_web_sm', parser=False, entity=False)        
df['Tokens'] = df.Sumcription.apply(lambda x: nlp(x))    
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS        
spacy_stopwords.add('attach')
df['Lema_Token']  = df.Tokens.apply(lambda x: " ".join([token.lemma_ for token in x if token not in spacy_stopwords]))

However, when I print for example:

df.Lema_Token.iloc[8]

The output still has the word attach in it: attach poster on the wall because it is cool

Why does it not remove the stop word?

I also tried this:

df['Lema_Token_Test']  = df.Tokens.apply(lambda x: [token.lemma_ for token in x if token not in spacy_stopwords])

But the str attach still appears.

402

asked Apr 23 '19 18:04

Nelly Yuki

1 Answers

import spacy
import pandas as pd

# Load spacy model
nlp = spacy.load('en', parser=False, entity=False)        

# New stop words list 
customize_stop_words = [
    'attach'
]

# Mark them as stop words
for w in customize_stop_words:
    nlp.vocab[w].is_stop = True


# Test data
df = pd.DataFrame( {'Sumcription': ["attach poster on the wall because it is cool",
                                   "eating and sleeping"]})

# Convert each row into spacy document and return the lemma of the tokens in 
# the document if it is not a sotp word. Finally join the lemmas into as a string
df['Sumcription_lema'] = df.Sumcription.apply(lambda text: 
                                          " ".join(token.lemma_ for token in nlp(text) 
                                                   if not token.is_stop))

print (df)

Output:

   Sumcription                                   Sumcription_lema
0  attach poster on the wall because it is cool  poster wall cool
1                           eating and sleeping         eat sleep

169

answered Sep 19 '22 15:09

mujjiga

Related questions
                            
                                Error "Unable to open Jupyter Notebook: Port 8888 is already in use"
                            
                                Understanding the "left_index" and "right_index" arguments in pandas merge
                            
                                python requests - encoding with 'idna' codec failed (UnicodeError: label empty or too long) error
                            
                                Python: Cosine similarity between two large numpy arrays
                            
                                Get filename after a CTRL+C on a file with Windows Explorer
                            
                                How can I plot 2d FEM results using matplotlib?
                            
                                Python docker-compose interpreter in Pycharm: Couldn't find docker binary
                            
                                How to get ISO8601 string for datetime with milliseconds instead of microseconds in python 3.5
                            
                                RabbitMQ pika.exceptions.ConnectionClosed (-1, "error(104, 'Connection reset by peer')")
                            
                                Dataclass subclass does not inherit __repr__
                            
                                Fundamental understanding of tvecs rvecs in OpenCV-ArUco
                            
                                Unknown string format on pd.to_datetime
                            
                                Django DateTimeField says 'You are 5.5 hours ahead of server time.'
                            
                                Create MultiIndex pandas DataFrame from dictionary with tuple keys
                            
                                Expand pandas dataframe column of dict into dataframe columns [duplicate]
                            
                                ModuleNotFoundError: No module named 'google.cloud'
                            
                                Controlling Bin Widths in Altair
                            
                                How to efficiently group pairs based on shared item?
                            
                                Detect whether current shell is powershell in python
                            
                                Groupby Apply Custom Function Pandas

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

removing stop words using spacy

Tags:

python

nlp

python-3.7

spacy

data-cleaning

Nelly Yuki

People also ask

1 Answers

mujjiga

Recent Activity

Donate For Us