I'm trying to use scikit-learn's CountVectorizer
to count character 2-grams, ignoring spaces. In the docs it mentions the parameter analyzer
which states
Whether the feature should be made of word or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries.
However, "char_wb" doesn't appear to work as I expected. For example:
corpus = [
"The blue dog Blue",
"Green the green cat",
"The green mouse",
]
# CountVectorizer character 2-grams with word boundaries
vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(2, 2), min_df=1)
X = vectorizer.fit_transform(corpus)
vectorizer.get_feature_names()
[' b',
' c',
' d',
' g',
' m',
' t',
'at',
'bl',
'ca', ....
Notice the examples like ' b' which include a space. What gives?
Option 'char_wb' creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.
CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.
The CountVectorizer will select the words/features/terms which occur the most frequently. It takes absolute values so if you set the 'max_features = 3', it will select the 3 most common words in the data. By setting 'binary = True', the CountVectorizer no more takes into consideration the frequency of the term/word.
I think this is a longstanding inaccuracy in the documentation, which you are welcome to help fix. It would be more correct to say that:
Option ‘char_wb’ creates character n-grams, but does not generate n-grams that cross word boundaries.
The change appears to have been made in this commit to ensure that; see the contributor's comment. It looks particularly awkward when comparing the bigrams output to that of analyzer='char'
, but when you increase to trigrams you will see that whitespace can begin or end an n-gram but cannot be in the middle. This helps to signify the word-initial or word-final nature of a feature without capturing noisy cross-word character n-grams. It also ensures that, unlike prior to that commit, all extracted n-grams have the same length!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With