Using Topic Model, how should we set up a "stop words" list?

Question

There are some standard stop lists, giving words like "a the of not" to be removed from corpus. However, I'm wondering, should the stop list change case by case?

For example, I have 10K of articles from a journal, then because of the structure of an article, basically you will see words like "introduction, review, conclusion, page" in every article. My concern is: should we remove these words from our corpus? (the words that every document has?) Thanks to every comment and suggestion.

pnv · Accepted Answer

I am working on a similar problem, but of text categorization. From my experience, it is good to have a domain specific set of stop word list along with the standard . list. Otherwise, these words like "introduction","review" etc. will come up in the term frequency matrix, if you have tried out analysing it. It can mislead your models by giving more weights to these domain specific keywords.

Using Topic Model, how should we set up a "stop words" list?

Tags:

text-classification

stop-words

lda

topic-modeling

Ruby

1 Answers

pnv

Recent Activity

Donate For Us

Using Topic Model, how should we set up a "stop words" list?

Tags:

text-classification

stop-words

lda

topic-modeling

Ruby

1 Answers

pnv

Related questions

Recent Activity

Donate For Us