Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get selected feature names TFIDF Vectorizer

I'm using python and I want to get the TFIDF representation for a large corpus of data, I'm using the following code to convert the docs into their TFIDF form.

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(
    min_df=1,  # min count for relevant vocabulary
    max_features=4000,  # maximum number of features
    strip_accents='unicode',  # replace all accented unicode char 
    # by their corresponding  ASCII char
    analyzer='word',  # features made of words
    token_pattern=r'\w{1,}',  # tokenize only words of 4+ chars
    ngram_range=(1, 1),  # features made of a single tokens
    use_idf=True,  # enable inverse-document-frequency reweighting
    smooth_idf=True,  # prevents zero division for unseen words
    sublinear_tf=False)

tfidf_df = tfidf_vectorizer.fit_transform(df['text'])

Here I pass a parameter max_features. The vectorizer will select the best features and return a scipy sparse matrix. Problem is I dont know which features are getting selected and how do I map those feature names back to the scipy matrix I get? Basically for the n selected features from the m number of documents, I want a m x n matrix with the selected features as the column names instead of their integer ids. How do I accomplish this?

like image 635
Clock Slave Avatar asked Mar 01 '17 06:03

Clock Slave


People also ask

How do I get feature names from TF-IDF Vectorizer?

You can use tfidf_vectorizer. get_feature_names() . This will print feature names selected (terms selected) from the raw documents. You can also use tfidf_vectorizer.

Which is better count Vectorizer or TF-IDF?

TF-IDF is better than Count Vectorizers because it not only focuses on the frequency of words present in the corpus but also provides the importance of the words. We can then remove the words that are less important for analysis, hence making the model building less complex by reducing the input dimensions.

What does TF-IDF Vectorizer do?

Term frequency-inverse document frequency is a text vectorizer that transforms the text into a usable vector. It combines 2 concepts, Term Frequency (TF) and Document Frequency (DF). The term frequency is the number of occurrences of a specific term in a document.

Does TfidfVectorizer remove stop words?

From the way the TfIdf score is set up, there shouldn't be any significant difference in removing the stopwords. The whole point of the Idf is exactly to remove words with no semantic value from the corpus. If you do add the stopwords, the Idf should get rid of it.

How does the tfidfvectorizer sort by N features?

When using the TfidfVectorizer with max_features=N (where N is not None ), I would expect the algorithm to sort by the tfidf score and then take the top N features. Instead, it sorts by document frequency.

What is TFIDF vectorizer?

TFIDF Vectorizer. In simple words, TFIDF is a numerical… | by Karan Arya | NLP Gurukool | Medium In simple words, TFIDF is a numerical statistic that shows the importance of a word in a text document. ['I', 'love', 'my', 'cat', 'but', 'the', 'cat', 'sat', 'on', 'my', 'face']

What is the difference between IDF and tf*idf?

IDF show how common or rare a given word is across all documents. Tf*Idf do not convert directly raw data into useful features. Firstly, it converts raw strings or dataset into vectors and each word has its own vector. Then we’ll use a particular technique for retrieving the feature like Cosine Similarity which works on vectors, etc.

How to convert raw data into useful features in tftf*IDF?

Tf*Idf do not convert directly raw data into useful features. Firstly, it converts raw strings or dataset into vectors and each word has its own vector. Then we’ll use a particular technique for retrieving the feature like Cosine Similarity which works on vectors, etc.


2 Answers

You can use tfidf_vectorizer.get_feature_names(). This will print feature names selected (terms selected) from the raw documents.

You can also use tfidf_vectorizer.vocabulary_ attribute to get a dict which will map the feature names to their indices, but will not be sorted. The array from get_feature_names() will be sorted by index.

like image 156
Vivek Kumar Avatar answered Oct 12 '22 21:10

Vivek Kumar


use tfidf_vectorizer.vocabulary_, this gives a mapping from the features (terms back to the indices)

like image 21
parsethis Avatar answered Oct 12 '22 20:10

parsethis