sklearn : TFIDF Transformer : How to get tf-idf values of given words in document

Tags:

python

scikit-learn

I used sklearn for calculating TFIDF (Term frequency inverse document frequency) values for documents using command as :

from sklearn.feature_extraction.text import CountVectorizer count_vect = CountVectorizer() X_train_counts = count_vect.fit_transform(documents) from sklearn.feature_extraction.text import TfidfTransformer tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts) X_train_tf = tf_transformer.transform(X_train_counts)

X_train_tf is a scipy.sparse matrix of shape (2257, 35788).

How can I get TF-IDF for words in a particular document? More specific, how to get words with maximum TF-IDF values in a given document?

276

asked Dec 24 '15 07:12

maximus

1 Answers

You can use TfidfVectorizer from sklean

from sklearn.feature_extraction.text import TfidfVectorizer import numpy as np from scipy.sparse.csr import csr_matrix #need this if you want to save tfidf_matrix  tf = TfidfVectorizer(input='filename', analyzer='word', ngram_range=(1,6),                      min_df = 0, stop_words = 'english', sublinear_tf=True) tfidf_matrix =  tf.fit_transform(corpus)

The above tfidf_matix has the TF-IDF values of all the documents in the corpus. This is a big sparse matrix. Now,

feature_names = tf.get_feature_names()

this gives you the list of all the tokens or n-grams or words. For the first document in your corpus,

doc = 0 feature_index = tfidf_matrix[doc,:].nonzero()[1] tfidf_scores = zip(feature_index, [tfidf_matrix[doc, x] for x in feature_index])

Lets print them,

for w, s in [(feature_names[i], s) for (i, s) in tfidf_scores]:   print w, s

103

answered Sep 19 '22 18:09

sud_

Related questions
                            
                                Import from sibling directory
                            
                                Is there an efficient way of concatenating scipy.sparse matrices?
                            
                                Python 'list indices must be integers, not tuple"
                            
                                Django template convert to string
                            
                                Does ImageDataGenerator add more images to my dataset?
                            
                                Choose list variable given probability of each variable
                            
                                How to invert a permutation array in numpy
                            
                                How to delete a file by extension in Python?
                            
                                Handling GET and POST in same Flask view
                            
                                Download a folder from S3 using Boto3
                            
                                Is there any legitimate use of list[True], list[False] in Python?
                            
                                Fetch all href link using selenium in python
                            
                                Running Tensorflow in Jupyter Notebook
                            
                                How can I get the screen size in Tkinter?
                            
                                Imputation of missing values for categories in pandas
                            
                                Compute pairwise distance in a batch without replicating tensor in Tensorflow?
                            
                                Merge a list of pandas dataframes
                            
                                Difference between "__method__" and "method" [duplicate]
                            
                                get script directory name - Python [duplicate]
                            
                                Temporarily Disabling Django Caching

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With