Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use sklearn's CountVectorizerand() to get ngrams that include any punctuation as separate tokens?

I use sklearn.feature_extraction.text.CountVectorizer to compute n-grams. Example:

import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html
ngram_size = 4
string = ["I really like python, it's pretty awesome."]
vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size))
vect.fit(string)
print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size))

outputs:

4-grams: [u'like python it pretty', u'python it pretty awesome', u'really like python it']

The punctuation is removed: how to include them as separate tokens?

like image 224
Franck Dernoncourt Avatar asked Aug 20 '15 21:08

Franck Dernoncourt


People also ask

Does CountVectorizer remove punctuation?

The default tokenization in CountVectorizer removes all special characters, punctuation and single characters. If this is not the behavior you desire, and you want to keep punctuation and special characters, you can provide a custom tokenizer to CountVectorizer.

How do you use count Vectorizer?

The CountVectorizer will select the words/features/terms which occur the most frequently. It takes absolute values so if you set the 'max_features = 3', it will select the 3 most common words in the data. By setting 'binary = True', the CountVectorizer no more takes into consideration the frequency of the term/word.

What is the function of the sub module Feature_extraction text CountVectorizer?

CountVectorizer. Convert a collection of text documents to a matrix of token counts. This implementation produces a sparse representation of the counts using scipy.

What is ngram in count Vectorizer?

CountVectorizer will tokenize the data and split it into chunks called n-grams, of which we can define the length by passing a tuple to the ngram_range argument. For example, 1,1 would give us unigrams or 1-grams such as “whey” and “protein”, while 2,2 would give us bigrams or 2-grams, such as “whey protein”.


1 Answers

You should specify a word tokenizer that considers any punctuation as a separate token when creating the sklearn.feature_extraction.text.CountVectorizer instance, using the tokenizer parameter.

For example, nltk.tokenize.TreebankWordTokenizer treats most punctuation characters as separate tokens:

import sklearn.feature_extraction.text
from nltk.tokenize import TreebankWordTokenizer

ngram_size = 4
string = ["I really like python, it's pretty awesome."]
vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size), \
                                                 tokenizer=TreebankWordTokenizer().tokenize)
print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size))

outputs:

4-grams: [u"'s pretty awesome .", u", it 's pretty", u'i really like python', 
          u"it 's pretty awesome", u'like python , it', u"python , it 's", 
          u'really like python ,']
like image 150
Franck Dernoncourt Avatar answered Nov 15 '22 00:11

Franck Dernoncourt