How do I use sklearn CountVectorizer with both 'word' and 'char' analyzer? http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
I could extract the text features by word or char separately but how do i create a charword_vectorizer
? Is there a way to combine the vectorizers? or use more than one analyzer?
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> word_vectorizer = CountVectorizer(analyzer='word', ngram_range=(1, 2), min_df=1)
>>> char_vectorizer = CountVectorizer(analyzer='char', ngram_range=(1, 2), min_df=1)
>>> x = ['this is a foo bar', 'you are a foo bar black sheep']
>>> word_vectorizer.fit_transform(x)
<2x15 sparse matrix of type '<type 'numpy.int64'>'
with 18 stored elements in Compressed Sparse Column format>
>>> char_vectorizer.fit_transform(x)
<2x47 sparse matrix of type '<type 'numpy.int64'>'
with 64 stored elements in Compressed Sparse Column format>
>>> char_vectorizer.get_feature_names()
[u' ', u' a', u' b', u' f', u' i', u' s', u'a', u'a ', u'ac', u'ar', u'b', u'ba', u'bl', u'c', u'ck', u'e', u'e ', u'ee', u'ep', u'f', u'fo', u'h', u'he', u'hi', u'i', u'is', u'k', u'k ', u'l', u'la', u'o', u'o ', u'oo', u'ou', u'p', u'r', u'r ', u're', u's', u's ', u'sh', u't', u'th', u'u', u'u ', u'y', u'yo']
>>> word_vectorizer.get_feature_names()
[u'are', u'are foo', u'bar', u'bar black', u'black', u'black sheep', u'foo', u'foo bar', u'is', u'is foo', u'sheep', u'this', u'this is', u'you', u'you are']
You can pass a callable as the analyzer
argument to get full control over the tokenization, e.g.
>>> from pprint import pprint
>>> import re
>>> x = ['this is a foo bar', 'you are a foo bar black sheep']
>>> def words_and_char_bigrams(text):
... words = re.findall(r'\w{3,}', text)
... for w in words:
... yield w
... for i in range(len(w) - 2):
... yield w[i:i+2]
...
>>> v = CountVectorizer(analyzer=words_and_char_bigrams)
>>> pprint(v.fit(x).vocabulary_)
{'ac': 0,
'ar': 1,
'are': 2,
'ba': 3,
'bar': 4,
'bl': 5,
'black': 6,
'ee': 7,
'fo': 8,
'foo': 9,
'he': 10,
'hi': 11,
'la': 12,
'sh': 13,
'sheep': 14,
'th': 15,
'this': 16,
'yo': 17,
'you': 18}
You can combine arbitrary feature extraction steps with the FeatureUnion estimator: http://scikit-learn.org/dev/modules/pipeline.html#featureunion-combining-feature-extractors
In this case this is probably less efficient than larsmans solution, but might be easier to use.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With