There are some standard stop lists, giving words like "a the of not" to be removed from corpus. However, I'm wondering, should the stop list change case by case?
For example, I have 10K of articles from a journal, then because of the structure of an article, basically you will see words like "introduction, review, conclusion, page" in every article. My concern is: should we remove these words from our corpus? (the words that every document has?) Thanks to every comment and suggestion.
I am working on a similar problem, but of text categorization. From my experience, it is good to have a domain specific set of stop word list along with the standard . list. Otherwise, these words like "introduction","review" etc. will come up in the term frequency matrix, if you have tried out analysing it. It can mislead your models by giving more weights to these domain specific keywords.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With