Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CountVectorizer does not print vocabulary

I have installed python 2.7, numpy 1.9.0, scipy 0.15.1 and scikit-learn 0.15.2. Now when I do the following in python:

train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.",
"We can see the shining sun, the bright sun.")

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

print vectorizer


    CountVectorizer(analyzer=u'word', binary=False, charset=None,
    charset_error=None, decode_error=u'strict',
    dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
    lowercase=True, max_df=1.0, max_features=None, min_df=1,
    ngram_range=(1, 1), preprocessor=None, stop_words=None,
    strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
    tokenizer=None, vocabulary=None)

     vectorizer.fit_transform(train_set)
    print vectorizer.vocabulary

    None.

Actually it should have printed the following:

CountVectorizer(analyzer__min_n=1,
analyzer__stop_words=set(['all', 'six', 'less', 'being', 'indeed', 'over',    
 'move', 'anyway', 'four', 'not', 'own', 'through', 'yourselves', (...) --->     
For count vectorizer

{'blue': 0, 'sun': 1, 'bright': 2, 'sky': 3} ---> for vocabulary

The above code are from the blog: http://blog.christianperone.com/?p=1589

Could you please help me as to why I get such an error. Since the vocabulary is not indexed properly I am not able to move ahead in understanding the concept of TF-IDF. I am a newbie for python so any help would be appreciated.

Arc.

like image 959
Archana Avatar asked Mar 06 '15 08:03

Archana


People also ask

How do you find the vocabulary in count Vectorizer?

If you just want the vocabulary without the position of the word in the sparse matrix, you can use the method 'get_feature_names()'. If you notice this is the same method we use while creating our database and setting our columns. CountVectorizer is just one of the methods to deal with textual data.

Does CountVectorizer remove stop words?

Stop words are words like a, an, the, is, has, of, are etc. Therefore removing stop words helps build cleaner dataset with better features for machine learning model. By instantiating count vectorizer with stop_words parameter, we are telling count vectorizer to remove stop words.

Is CountVectorizer bag of words?

Count vectorizer creates a matrix with documents and token counts (bag of terms/tokens) therefore it is also known as document term matrix (dtm).

Does CountVectorizer remove punctuation?

The default tokenization in CountVectorizer removes all special characters, punctuation and single characters. If this is not the behavior you desire, and you want to keep punctuation and special characters, you can provide a custom tokenizer to CountVectorizer.


2 Answers

You are missing an underscore, try this way:

from sklearn.feature_extraction.text import CountVectorizer
train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.", 
    "We can see the shining sun, the bright sun.")

vectorizer = CountVectorizer(stop_words='english')
document_term_matrix = vectorizer.fit_transform(train_set)
print vectorizer.vocabulary_
# {u'blue': 0, u'sun': 3, u'bright': 1, u'sky': 2}

If you use the ipython shell, you can use tab completion, and you can find easier the methods and attributes of objects.

like image 113
Balint Domokos Avatar answered Nov 02 '22 23:11

Balint Domokos


Try using the vectorizer.get_feature_names() method. It gives the column names in the order it appears in the document_term_matrix.

from sklearn.feature_extraction.text import CountVectorizer
train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.", 
    "We can see the shining sun, the bright sun.")

vectorizer = CountVectorizer(stop_words='english')
document_term_matrix = vectorizer.fit_transform(train_set)
vectorizer.get_feature_names()
#> ['blue', 'bright', 'sky', 'sun']
like image 23
Selva Avatar answered Nov 03 '22 00:11

Selva