How to set custom stop words for sklearn CountVectorizer?

Question

I'm trying to run LDA (Latent Dirichlet Allocation) on a non-English text dataset.

From sklearn's tutorial, there's this part where you count term frequency of the words to feed into the LDA:

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                            max_features=n_features,
                            stop_words='english')

Which has built-in stop words feature which is only available for English I think. How could I use my own stop words list for this?

Wiktor Stribiżew · Accepted Answer

You may just assign a frozenset of your own words to the stop_words argument, e.g.:

stop_words = frozenset(["word1", "word2","word3"])

How to set custom stop words for sklearn CountVectorizer?

Tags:

python

machine-learning

nlp

scikit-learn

troll

1 Answers

Wiktor Stribiżew

Recent Activity

Donate For Us

How to set custom stop words for sklearn CountVectorizer?

Tags:

python

machine-learning

nlp

scikit-learn

troll

1 Answers

Wiktor Stribiżew

Related questions

Recent Activity

Donate For Us