I'm trying to run LDA (Latent Dirichlet Allocation) on a non-English text dataset.
From sklearn's tutorial, there's this part where you count term frequency of the words to feed into the LDA:
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
max_features=n_features,
stop_words='english')
Which has built-in stop words feature which is only available for English I think. How could I use my own stop words list for this?
You may just assign a frozenset
of your own words to the stop_words
argument, e.g.:
stop_words = frozenset(["word1", "word2","word3"])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With