I am using CountVectorizer from sklearn...looking to provide a list of stop words and apply the count vectorizer for ngram_range of (1,3).
From what I can tell, if a word - say "me" - is in the list of stop words, then it doesn't get seen for higher ngrams i.e., "tell me" would not be a feature. Is there a way that I can specify something like, "consider stop words only when ngram is 1"?
You have at least 2 options:
combine 2 kinds of features with FeatureUnion: one for ngram_range of (1,1) with stop words and one for ngram_range of (2,3) without stop words
(more efficient, but harder to implement and use) implement your own analyzer that will check for presence in stop word list only for unigrams; see for example code sample in this answer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With