How to use sklearn's CountVectorizerand() to get ngrams that include any punctuation as separate tokens?

Tags:

I use sklearn.feature_extraction.text.CountVectorizer to compute n-grams. Example:

import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html
ngram_size = 4
string = ["I really like python, it's pretty awesome."]
vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size))
vect.fit(string)
print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size))

outputs:

4-grams: [u'like python it pretty', u'python it pretty awesome', u'really like python it']

The punctuation is removed: how to include them as separate tokens?

224

asked Aug 20 '15 21:08

Franck Dernoncourt

1 Answers

You should specify a word tokenizer that considers any punctuation as a separate token when creating the sklearn.feature_extraction.text.CountVectorizer instance, using the tokenizer parameter.

For example, nltk.tokenize.TreebankWordTokenizer treats most punctuation characters as separate tokens:

import sklearn.feature_extraction.text
from nltk.tokenize import TreebankWordTokenizer

ngram_size = 4
string = ["I really like python, it's pretty awesome."]
vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size), \
                                                 tokenizer=TreebankWordTokenizer().tokenize)
print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size))

outputs:

4-grams: [u"'s pretty awesome .", u", it 's pretty", u'i really like python', 
          u"it 's pretty awesome", u'like python , it', u"python , it 's", 
          u'really like python ,']

150

answered Nov 15 '22 00:11

Franck Dernoncourt

Related questions
                            
                                Why is there no execution time difference between multithreading and singlethreading
                            
                                wtforms, generate fields in constructor
                            
                                How to load in Python an xlsx that originally had .xls file extension?
                            
                                Python subprocess call with greater sign (>) not working [duplicate]
                            
                                How to plot two DataFrame on same graph for comparison
                            
                                Better way to pack numpy array?
                            
                                Merging two dicts in python with no duplication permitted
                            
                                Python and Selenium To “execute_script” to solve “ElementNotVisibleException”
                            
                                Slicing a circle in equal segments, Python
                            
                                Django: Optional model form field
                            
                                tweepy.error.TweepError: Twitter error response: status code = 401
                            
                                Django: Assigning ForeignKey - Unable to get repr for class
                            
                                Django - A model can't have more than one AutoField
                            
                                python map array of dictionaries to dictionary?
                            
                                Subprocess command not finding files using ls command?
                            
                                Exclude a directory from getting zipped using zipfile module in python
                            
                                Rejecting files greater than a certain amount with Flask-Uploads?
                            
                                Pylint complains about method 'data_received' not overridden, for RequestHandler
                            
                                Matplotlib 3D plot use colormap
                            
                                GeoDjango: How can I get the distance between two points?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to use sklearn's CountVectorizerand() to get ngrams that include any punctuation as separate tokens?

Tags:

python

tokenize

nlp

scikit-learn

n-gram

Franck Dernoncourt

People also ask

1 Answers

Franck Dernoncourt

Recent Activity

Donate For Us