Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scikit-learn TfidfVectorizer meaning?

Tags:

I was reading about TfidfVectorizer implementation of scikit-learn, i don´t understand what´s the output of the method, for example:

new_docs = ['He watches basketball and baseball', 'Julie likes to play basketball', 'Jane loves to play baseball'] new_term_freq_matrix = tfidf_vectorizer.transform(new_docs) print tfidf_vectorizer.vocabulary_ print new_term_freq_matrix.todense() 

output:

{u'me': 8, u'basketball': 1, u'julie': 4, u'baseball': 0, u'likes': 5, u'loves': 7, u'jane': 3, u'linda': 6, u'more': 9, u'than': 10, u'he': 2} [[ 0.57735027  0.57735027  0.57735027  0.          0.          0.          0.    0.          0.          0.          0.        ]  [ 0.          0.68091856  0.          0.          0.51785612  0.51785612    0.          0.          0.          0.          0.        ]  [ 0.62276601  0.          0.          0.62276601  0.          0.          0.    0.4736296   0.          0.          0.        ]] 

What is?(e.g.: u'me': 8 ):

{u'me': 8, u'basketball': 1, u'julie': 4, u'baseball': 0, u'likes': 5, u'loves': 7, u'jane': 3, u'linda': 6, u'more': 9, u'than': 10, u'he': 2} 

is this a matrix or just a vector?, i can´t understand what´s telling me the output:

[[ 0.57735027  0.57735027  0.57735027  0.          0.          0.          0.    0.          0.          0.          0.        ]  [ 0.          0.68091856  0.          0.          0.51785612  0.51785612    0.          0.          0.          0.          0.        ]  [ 0.62276601  0.          0.          0.62276601  0.          0.          0.    0.4736296   0.          0.          0.        ]] 

Could anybody explain me in more detail these outputs?

Thanks!

like image 282
anon Avatar asked Sep 17 '14 23:09

anon


People also ask

What is TfidfVectorizer in Sklearn?

Scikit-learn's Tfidftransformer and Tfidfvectorizer aim to do the same thing, which is to convert a collection of raw documents to a matrix of TF-IDF features. The differences between the two modules can be quite confusing and it's hard to know when to use which.

Why do we use TfidfVectorizer?

In TfidfVectorizer we consider overall document weightage of a word. It helps us in dealing with most frequent words. Using it we can penalize them. TfidfVectorizer weights the word counts by a measure of how often they appear in the documents.

What is TfidfVectorizer example?

It converts a collection of raw documents to a matrix of TF-IDF features. As tf–idf is very often used for text features, the class TfidfVectorizer combines all the options of CountVectorizer and TfidfTransformer into a single model.

What does TF-IDF mean?

TF-IDF stands for term frequency-inverse document frequency and it is a measure, used in the fields of information retrieval (IR) and machine learning, that can quantify the importance or relevance of string representations (words, phrases, lemmas, etc) in a document amongst a collection of documents (also known as a ...


2 Answers

TfidfVectorizer - Transforms text to feature vectors that can be used as input to estimator.

vocabulary_ Is a dictionary that converts each token (word) to feature index in the matrix, each unique token gets a feature index.

What is?(e.g.: u'me': 8 )

It tells you that the token 'me' is represented as feature number 8 in the output matrix.

is this a matrix or just a vector?

Each sentence is a vector, the sentences you've entered are matrix with 3 vectors. In each vector the numbers (weights) represent features tf-idf score. For example: 'julie': 4 --> Tells you that the in each sentence 'Julie' appears you will have non-zero (tf-idf) weight. As you can see in the 2'nd vector:

[ 0. 0.68091856 0. 0. 0.51785612 0.51785612 0. 0. 0. 0. 0. ]

The 5'th element scored 0.51785612 - the tf-idf score for 'Julie'. For more info about Tf-Idf scoring read here: http://en.wikipedia.org/wiki/Tf%E2%80%93idf

like image 110
D Volsky Avatar answered Oct 21 '22 09:10

D Volsky


So tf-idf creates a set of its own vocabulary from the entire set of documents. Which is seen in first line of output. (for better understanding I have sorted it)

{u'baseball': 0, u'basketball': 1, u'he': 2, u'jane': 3, u'julie': 4, u'likes': 5, u'linda': 6,  u'loves': 7, u'me': 8, u'more': 9, u'than': 10, } 

And when the document is parsed to get its tf-idf. Document:

He watches basketball and baseball

and its output,

[ 0.57735027 0.57735027 0.57735027 0. 0. 0. 0. 0. 0. 0. 0. ]

is equivalent to,

[baseball basketball he jane julie likes linda loves me more than]

Since our document has only these words: baseball, basketball, he, from the vocabulary created. The document vector output has values of tf-idf for only these three words and in the same sorted vocabulary position.

tf-idf is used to classify documents, ranking in search engine. tf: term frequency(count of the words present in document from its own vocabulary), idf: inverse document frequency(importance of the word to each document).

like image 27
Rajesh Mappu Avatar answered Oct 21 '22 09:10

Rajesh Mappu