Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to see top n entries of term-document matrix after tfidf in scikit-learn

I am new to scikit-learn, and I was using TfidfVectorizer to find the tfidf values of terms in a set of documents. I used the following code to obtain the same.

vectorizer = TfidfVectorizer(stop_words=u'english',ngram_range=(1,5),lowercase=True) X = vectorizer.fit_transform(lectures) 

Now If I print X, I am able to see all the entries in matrix, but how can I find top n entries based on tfidf score. In addition to that is there any method that will help me to find top n entries based on tfidf score per ngram i.e. top entries among unigram,bigram,trigram and so on?

like image 545
Amrith Krishna Avatar asked Aug 09 '14 10:08

Amrith Krishna


People also ask

What is the difference between TfidfVectorizer and Tfidftransformer?

Tfidftransformer and Tfidfvectorizer aim to do the same thing, which is to convert a collection of raw documents to a matrix of TF-IDF features. The only difference is that with Tfidftransformer, you will systematically compute the word counts, generate idf values and then compute a tfidf score or set of scores.

What is the difference between Countvectorizer and TfidfVectorizer?

TF-IDF is better than Count Vectorizers because it not only focuses on the frequency of words present in the corpus but also provides the importance of the words. We can then remove the words that are less important for analysis, hence making the model building less complex by reducing the input dimensions.

What is Sklearn Feature_extraction?

The sklearn. feature_extraction module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image.

Does TfidfVectorizer do Stemming?

In particular, we pass the TfIdfVectorizer our own function that performs custom tokenization and stemming, but we use scikit-learn's built in stop word remove rather than NLTK's. Then we call fit_transform which does a few things: first, it creates a dictionary of 'known' words based on the input text given to it.


1 Answers

Since version 0.15, the global term weighting of the features learnt by a TfidfVectorizer can be accessed through the attribute idf_, which will return an array of length equal to the feature dimension. Sort the features by this weighting to get the top weighted features:

from sklearn.feature_extraction.text import TfidfVectorizer import numpy as np  lectures = ["this is some food", "this is some drink"] vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(lectures) indices = np.argsort(vectorizer.idf_)[::-1] features = vectorizer.get_feature_names() top_n = 2 top_features = [features[i] for i in indices[:top_n]] print top_features 

Output:

[u'food', u'drink'] 

The second problem of getting the top features by ngram can be done using the same idea, with some extra steps of splitting the features into different groups:

from sklearn.feature_extraction.text import TfidfVectorizer from collections import defaultdict  lectures = ["this is some food", "this is some drink"] vectorizer = TfidfVectorizer(ngram_range=(1,2)) X = vectorizer.fit_transform(lectures) features_by_gram = defaultdict(list) for f, w in zip(vectorizer.get_feature_names(), vectorizer.idf_):     features_by_gram[len(f.split(' '))].append((f, w)) top_n = 2 for gram, features in features_by_gram.iteritems():     top_features = sorted(features, key=lambda x: x[1], reverse=True)[:top_n]     top_features = [f[0] for f in top_features]     print '{}-gram top:'.format(gram), top_features 

Output:

1-gram top: [u'drink', u'food'] 2-gram top: [u'some drink', u'some food'] 
like image 146
YS-L Avatar answered Sep 30 '22 01:09

YS-L