I have installed python 2.7, numpy 1.9.0, scipy 0.15.1 and scikit-learn 0.15.2. Now when I do the following in python:
train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.",
"We can see the shining sun, the bright sun.")
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
print vectorizer
CountVectorizer(analyzer=u'word', binary=False, charset=None,
charset_error=None, decode_error=u'strict',
dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None)
vectorizer.fit_transform(train_set)
print vectorizer.vocabulary
None.
Actually it should have printed the following:
CountVectorizer(analyzer__min_n=1,
analyzer__stop_words=set(['all', 'six', 'less', 'being', 'indeed', 'over',
'move', 'anyway', 'four', 'not', 'own', 'through', 'yourselves', (...) --->
For count vectorizer
{'blue': 0, 'sun': 1, 'bright': 2, 'sky': 3} ---> for vocabulary
The above code are from the blog: http://blog.christianperone.com/?p=1589
Could you please help me as to why I get such an error. Since the vocabulary is not indexed properly I am not able to move ahead in understanding the concept of TF-IDF. I am a newbie for python so any help would be appreciated.
Arc.
If you just want the vocabulary without the position of the word in the sparse matrix, you can use the method 'get_feature_names()'. If you notice this is the same method we use while creating our database and setting our columns. CountVectorizer is just one of the methods to deal with textual data.
Stop words are words like a, an, the, is, has, of, are etc. Therefore removing stop words helps build cleaner dataset with better features for machine learning model. By instantiating count vectorizer with stop_words parameter, we are telling count vectorizer to remove stop words.
Count vectorizer creates a matrix with documents and token counts (bag of terms/tokens) therefore it is also known as document term matrix (dtm).
The default tokenization in CountVectorizer removes all special characters, punctuation and single characters. If this is not the behavior you desire, and you want to keep punctuation and special characters, you can provide a custom tokenizer to CountVectorizer.
You are missing an underscore, try this way:
from sklearn.feature_extraction.text import CountVectorizer
train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.",
"We can see the shining sun, the bright sun.")
vectorizer = CountVectorizer(stop_words='english')
document_term_matrix = vectorizer.fit_transform(train_set)
print vectorizer.vocabulary_
# {u'blue': 0, u'sun': 3, u'bright': 1, u'sky': 2}
If you use the ipython shell, you can use tab completion, and you can find easier the methods and attributes of objects.
Try using the vectorizer.get_feature_names()
method. It gives the column names in the order it appears in the document_term_matrix
.
from sklearn.feature_extraction.text import CountVectorizer
train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.",
"We can see the shining sun, the bright sun.")
vectorizer = CountVectorizer(stop_words='english')
document_term_matrix = vectorizer.fit_transform(train_set)
vectorizer.get_feature_names()
#> ['blue', 'bright', 'sky', 'sun']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With