How can I restrict the token length while using CountVectorizer?

Question

I do not want terms of length less than 3 or more than say 7.There's a straightforward way of doing this in R , but in Python I am not sure. I tried this, but still doesn't work

from sklearn.feature_extraction.text import CountVectorizer
regex1 = '/^[a-zA-Z]{3,7}$/'
vectorizer = CountVectorizer( analyzer='word',tokenizer= tokenize,stop_words = stopwords,token_pattern  = regex1,min_df= 2, max_df = 0.9,max_features = 2000)
vectorizer1 = vectorizer.fit_transform(token_dict.values())

Tried other regex too -

  "^[a-zA-Z]{3,7}$"
r'^[a-zA-Z]{3,7}$'

Aritesh · Accepted Answer

In the documentation of CountVectorizer, it is provided that default token_pattern takes tokens of 2 or more alphanumeric characters. If you want to change this, pass your own regex

In your case, add token_pattern = "^[a-zA-Z]{3,7}$" to the options of CountVectorizer

Edit

The regex that should be used is [a-zA-Z]{3,7}. See Example below -

doc1 = ["Elon Musk is genius", "Are you mad", "Constitutional Ammendments in Indian Parliament",\
        "Constitutional Ammendments in Indian Assembly", "House of Cards", "Indian House"]

from sklearn.feature_extraction.text import CountVectorizer

regex1 = '[a-zA-Z]{3,7}'
vectorizer = CountVectorizer(analyzer='word', stop_words = 'english', token_pattern  = regex1)
vectorizer1 = vectorizer.fit_transform(doc1)

vectorizer.vocabulary_

Results -

{u'ammendm': 0,
 u'assembl': 1,
 u'cards': 2,
 u'constit': 3,
 u'elon': 4,
 u'ent': 5,
 u'ents': 6,
 u'genius': 7,
 u'house': 8,
 u'indian': 9,
 u'mad': 10,
 u'musk': 11,
 u'parliam': 12,
 u'utional': 13}

How can I restrict the token length while using CountVectorizer?

Tags:

python

python-3.x

scikit-learn

countvectorizer

Indi

1 Answers

Aritesh

Recent Activity

Donate For Us

How can I restrict the token length while using CountVectorizer?

Tags:

python

python-3.x

scikit-learn

countvectorizer

Indi

1 Answers

Aritesh

Related questions

Recent Activity

Donate For Us