Get selected feature names TFIDF Vectorizer

Tags:

I'm using python and I want to get the TFIDF representation for a large corpus of data, I'm using the following code to convert the docs into their TFIDF form.

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(
    min_df=1,  # min count for relevant vocabulary
    max_features=4000,  # maximum number of features
    strip_accents='unicode',  # replace all accented unicode char 
    # by their corresponding  ASCII char
    analyzer='word',  # features made of words
    token_pattern=r'\w{1,}',  # tokenize only words of 4+ chars
    ngram_range=(1, 1),  # features made of a single tokens
    use_idf=True,  # enable inverse-document-frequency reweighting
    smooth_idf=True,  # prevents zero division for unseen words
    sublinear_tf=False)

tfidf_df = tfidf_vectorizer.fit_transform(df['text'])

Here I pass a parameter max_features. The vectorizer will select the best features and return a scipy sparse matrix. Problem is I dont know which features are getting selected and how do I map those feature names back to the scipy matrix I get? Basically for the n selected features from the m number of documents, I want a m x n matrix with the selected features as the column names instead of their integer ids. How do I accomplish this?

635

asked Mar 01 '17 06:03

Clock Slave

2 Answers

You can use tfidf_vectorizer.get_feature_names(). This will print feature names selected (terms selected) from the raw documents.

You can also use tfidf_vectorizer.vocabulary_ attribute to get a dict which will map the feature names to their indices, but will not be sorted. The array from get_feature_names() will be sorted by index.

156

answered Oct 12 '22 21:10

Vivek Kumar

use tfidf_vectorizer.vocabulary_, this gives a mapping from the features (terms back to the indices)

answered Oct 12 '22 20:10

parsethis

Related questions
                            
                                Django South Error: AttributeError: 'DateTimeField' object has no attribute 'model'`
                            
                                Django rest framework api_view vs normal view
                            
                                Debugging Python Fatal Error: GC Object already Tracked
                            
                                trace python: only include some files
                            
                                python 3 map/lambda method with 2 inputs
                            
                                Using win32com with multithreading
                            
                                How to use lambda as method within a class?
                            
                                How to use Paramiko logging?
                            
                                Is passing too many arguments to the constructor considered an anti-pattern?
                            
                                Pandas to_csv call is prepending a comma
                            
                                How to decrease colorbar WIDTH in matplotlib?
                            
                                Await Future from Executor: Future can't be used in 'await' expression
                            
                                PyCharm print end='\r' statement not working
                            
                                Django Pandas to http response (download file)
                            
                                How to download google image search results in Python
                            
                                Python: how to specify output folders in Pyinstaller .spec file
                            
                                Is it possible to stream video from https:// (e.g. YouTube) into python with OpenCV?
                            
                                Pandas Melt several groups of columns into multiple target columns by name
                            
                                Exception: Cannot find PyQt5 plugin directories when using Pyinstaller despite PyQt5 not even being used
                            
                                Second Derivative in Python - scipy/numpy/pandas

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Get selected feature names TFIDF Vectorizer

Tags:

python

nlp

scikit-learn

Clock Slave

People also ask

2 Answers

Vivek Kumar

parsethis

Recent Activity

Donate For Us