Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it necessary to do stopwords removal ,Stemming/Lemmatization for text classification while using Spacy,Bert?

Is stopwords removal ,Stemming and Lemmatization necessary for text classification while using Spacy,Bert or other advanced NLP models for getting the vector embedding of the text ?

text="The food served in the wedding was very delicious"

1.since Spacy,Bert were trained on huge raw datasets are there any benefits of apply stopwords removal ,Stemming and Lemmatization on these text before generating the embedding using bert/spacy for text classification task ?

2.I can understand stopwords removal ,Stemming and Lemmatization will be good when we use countvectorizer,tfidf vectorizer to get embedding of sentences .

like image 667
star Avatar asked Aug 28 '20 12:08

star


3 Answers

You can test to see if doing stemming lemmatization and stopword removal helps. It doesn't always. I usually do if I gonna graph as the stopwords clutter up the results.

A case for not using Stopwords Using Stopwords will provide context to the user's intent, so when you use a contextual model like BERT. In such models like BERT, all stopwords are kept to provide enough context information like the negation words (not, nor, never) which are considered to be stopwords.

According to https://arxiv.org/pdf/1904.07531.pdf

"Surprisingly, the stopwords received as much attention as non-stop words, but removing them has no effect inMRR performances. "

like image 140
john taylor Avatar answered Oct 17 '22 05:10

john taylor


With BERT you don't process the texts; otherwise, you lose the context (stemming, lemmatization) or change the texts outright (stop words removal).

Some more basic models (rule-based or bag-of-words) would benefit from some processing, but you must be very careful with stop words removal: many words that change the meaning of an entire sentence are stop words (not, no, never, unless).

like image 35
Jiulin Teng Avatar answered Oct 17 '22 05:10

Jiulin Teng


  • Do not remove SW, as they add new information(context-awareness) to the sentence (viz., text summarization, machine/language translation, language modeling, question-answering)

  • Remove SW if we want only general idea of the sentence (viz., sentiment analysis, language/text classification, spam filtering, caption generation, auto-tag generation, topic/document

like image 2
rohan goli Avatar answered Oct 17 '22 06:10

rohan goli