scikit-learn TfidfVectorizer meaning?

Tags:

I was reading about TfidfVectorizer implementation of scikit-learn, i don´t understand what´s the output of the method, for example:

new_docs = ['He watches basketball and baseball', 'Julie likes to play basketball', 'Jane loves to play baseball'] new_term_freq_matrix = tfidf_vectorizer.transform(new_docs) print tfidf_vectorizer.vocabulary_ print new_term_freq_matrix.todense()

output:

{u'me': 8, u'basketball': 1, u'julie': 4, u'baseball': 0, u'likes': 5, u'loves': 7, u'jane': 3, u'linda': 6, u'more': 9, u'than': 10, u'he': 2} [[ 0.57735027  0.57735027  0.57735027  0.          0.          0.          0.    0.          0.          0.          0.        ]  [ 0.          0.68091856  0.          0.          0.51785612  0.51785612    0.          0.          0.          0.          0.        ]  [ 0.62276601  0.          0.          0.62276601  0.          0.          0.    0.4736296   0.          0.          0.        ]]

What is?(e.g.: u'me': 8 ):

{u'me': 8, u'basketball': 1, u'julie': 4, u'baseball': 0, u'likes': 5, u'loves': 7, u'jane': 3, u'linda': 6, u'more': 9, u'than': 10, u'he': 2}

is this a matrix or just a vector?, i can´t understand what´s telling me the output:

[[ 0.57735027  0.57735027  0.57735027  0.          0.          0.          0.    0.          0.          0.          0.        ]  [ 0.          0.68091856  0.          0.          0.51785612  0.51785612    0.          0.          0.          0.          0.        ]  [ 0.62276601  0.          0.          0.62276601  0.          0.          0.    0.4736296   0.          0.          0.        ]]

Could anybody explain me in more detail these outputs?

Thanks!

282

asked Sep 17 '14 23:09

anon

2 Answers

TfidfVectorizer - Transforms text to feature vectors that can be used as input to estimator.

vocabulary_ Is a dictionary that converts each token (word) to feature index in the matrix, each unique token gets a feature index.

What is?(e.g.: u'me': 8 )

It tells you that the token 'me' is represented as feature number 8 in the output matrix.

is this a matrix or just a vector?

Each sentence is a vector, the sentences you've entered are matrix with 3 vectors. In each vector the numbers (weights) represent features tf-idf score. For example: 'julie': 4 --> Tells you that the in each sentence 'Julie' appears you will have non-zero (tf-idf) weight. As you can see in the 2'nd vector:

[ 0. 0.68091856 0. 0. 0.51785612 0.51785612 0. 0. 0. 0. 0. ]

The 5'th element scored 0.51785612 - the tf-idf score for 'Julie'. For more info about Tf-Idf scoring read here: http://en.wikipedia.org/wiki/Tf%E2%80%93idf

110

answered Oct 21 '22 09:10

D Volsky

So tf-idf creates a set of its own vocabulary from the entire set of documents. Which is seen in first line of output. (for better understanding I have sorted it)

{u'baseball': 0, u'basketball': 1, u'he': 2, u'jane': 3, u'julie': 4, u'likes': 5, u'linda': 6,  u'loves': 7, u'me': 8, u'more': 9, u'than': 10, }

And when the document is parsed to get its tf-idf. Document:

He watches basketball and baseball

and its output,

[ 0.57735027 0.57735027 0.57735027 0. 0. 0. 0. 0. 0. 0. 0. ]

is equivalent to,

[baseball basketball he jane julie likes linda loves me more than]

Since our document has only these words: baseball, basketball, he, from the vocabulary created. The document vector output has values of tf-idf for only these three words and in the same sorted vocabulary position.

tf-idf is used to classify documents, ranking in search engine. tf: term frequency(count of the words present in document from its own vocabulary), idf: inverse document frequency(importance of the word to each document).

answered Oct 21 '22 09:10

Rajesh Mappu

Related questions
                            
                                Logging best practices and thoughts
                            
                                Zero-Initialize array member in initialization list
                            
                                Neo4j super node issue - fanning out pattern
                            
                                using-declaration in derived class does not hide same function derived from base class
                            
                                static_assert dependent on non-type template parameter (different behavior on gcc and clang)
                            
                                How to write log messages to file using Spring Boot?
                            
                                How can I find the size of a RDD
                            
                                Behavior of Assembly.GetTypes() changed in Visual Studio 2015
                            
                                Getting the accuracy for multi-label prediction in scikit-learn
                            
                                How to get all process ids without ps command on Linux
                            
                                What does $$ and <generated> mean in Java stacktrace?
                            
                                Android Google SignIn not working in debug mode: GoogleSignInResult is false

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With