I am new to scikit-learn, and I was using TfidfVectorizer
to find the tfidf values of terms in a set of documents. I used the following code to obtain the same.
vectorizer = TfidfVectorizer(stop_words=u'english',ngram_range=(1,5),lowercase=True) X = vectorizer.fit_transform(lectures)
Now If I print X, I am able to see all the entries in matrix, but how can I find top n entries based on tfidf score. In addition to that is there any method that will help me to find top n entries based on tfidf score per ngram i.e. top entries among unigram,bigram,trigram and so on?
Tfidftransformer and Tfidfvectorizer aim to do the same thing, which is to convert a collection of raw documents to a matrix of TF-IDF features. The only difference is that with Tfidftransformer, you will systematically compute the word counts, generate idf values and then compute a tfidf score or set of scores.
TF-IDF is better than Count Vectorizers because it not only focuses on the frequency of words present in the corpus but also provides the importance of the words. We can then remove the words that are less important for analysis, hence making the model building less complex by reducing the input dimensions.
The sklearn. feature_extraction module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image.
In particular, we pass the TfIdfVectorizer our own function that performs custom tokenization and stemming, but we use scikit-learn's built in stop word remove rather than NLTK's. Then we call fit_transform which does a few things: first, it creates a dictionary of 'known' words based on the input text given to it.
Since version 0.15, the global term weighting of the features learnt by a TfidfVectorizer
can be accessed through the attribute idf_
, which will return an array of length equal to the feature dimension. Sort the features by this weighting to get the top weighted features:
from sklearn.feature_extraction.text import TfidfVectorizer import numpy as np lectures = ["this is some food", "this is some drink"] vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(lectures) indices = np.argsort(vectorizer.idf_)[::-1] features = vectorizer.get_feature_names() top_n = 2 top_features = [features[i] for i in indices[:top_n]] print top_features
Output:
[u'food', u'drink']
The second problem of getting the top features by ngram can be done using the same idea, with some extra steps of splitting the features into different groups:
from sklearn.feature_extraction.text import TfidfVectorizer from collections import defaultdict lectures = ["this is some food", "this is some drink"] vectorizer = TfidfVectorizer(ngram_range=(1,2)) X = vectorizer.fit_transform(lectures) features_by_gram = defaultdict(list) for f, w in zip(vectorizer.get_feature_names(), vectorizer.idf_): features_by_gram[len(f.split(' '))].append((f, w)) top_n = 2 for gram, features in features_by_gram.iteritems(): top_features = sorted(features, key=lambda x: x[1], reverse=True)[:top_n] top_features = [f[0] for f in top_features] print '{}-gram top:'.format(gram), top_features
Output:
1-gram top: [u'drink', u'food'] 2-gram top: [u'some drink', u'some food']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With