Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Only ignore stop words for ngram_range=1

I am using CountVectorizer from sklearn...looking to provide a list of stop words and apply the count vectorizer for ngram_range of (1,3).

From what I can tell, if a word - say "me" - is in the list of stop words, then it doesn't get seen for higher ngrams i.e., "tell me" would not be a feature. Is there a way that I can specify something like, "consider stop words only when ngram is 1"?

like image 632
Natalie Arellano Avatar asked May 09 '15 22:05

Natalie Arellano


1 Answers

You have at least 2 options:

  1. combine 2 kinds of features with FeatureUnion: one for ngram_range of (1,1) with stop words and one for ngram_range of (2,3) without stop words

  2. (more efficient, but harder to implement and use) implement your own analyzer that will check for presence in stop word list only for unigrams; see for example code sample in this answer.

like image 104
Nikita Astrakhantsev Avatar answered Oct 10 '22 11:10

Nikita Astrakhantsev