I use sklearn.feature_extraction.text.CountVectorizer to compute n-grams. Example:
import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html
ngram_size = 4
string = ["I really like python, it's pretty awesome."]
vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size))
vect.fit(string)
print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size))
outputs:
4-grams: [u'like python it pretty', u'python it pretty awesome', u'really like python it']
The punctuation is removed: how to include them as separate tokens?
The default tokenization in CountVectorizer removes all special characters, punctuation and single characters. If this is not the behavior you desire, and you want to keep punctuation and special characters, you can provide a custom tokenizer to CountVectorizer.
The CountVectorizer will select the words/features/terms which occur the most frequently. It takes absolute values so if you set the 'max_features = 3', it will select the 3 most common words in the data. By setting 'binary = True', the CountVectorizer no more takes into consideration the frequency of the term/word.
CountVectorizer. Convert a collection of text documents to a matrix of token counts. This implementation produces a sparse representation of the counts using scipy.
CountVectorizer will tokenize the data and split it into chunks called n-grams, of which we can define the length by passing a tuple to the ngram_range argument. For example, 1,1 would give us unigrams or 1-grams such as “whey” and “protein”, while 2,2 would give us bigrams or 2-grams, such as “whey protein”.
You should specify a word tokenizer that considers any punctuation as a separate token when creating the sklearn.feature_extraction.text.CountVectorizer instance, using the tokenizer
parameter.
For example, nltk.tokenize.TreebankWordTokenizer
treats most punctuation characters as separate tokens:
import sklearn.feature_extraction.text
from nltk.tokenize import TreebankWordTokenizer
ngram_size = 4
string = ["I really like python, it's pretty awesome."]
vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size), \
tokenizer=TreebankWordTokenizer().tokenize)
print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size))
outputs:
4-grams: [u"'s pretty awesome .", u", it 's pretty", u'i really like python',
u"it 's pretty awesome", u'like python , it', u"python , it 's",
u'really like python ,']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With