Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select top n TFIDF features for a given document

I am working with TFIDF sparse matrices for document classification and want to retain only the top n (say 50) terms for each document (ranked by TFIDF score). See EDIT below.

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='english', 
                              token_pattern='[A-Za-z][\w\-]*', max_df=0.25)
n = 50

df = pd.read_pickle('my_df.pickle')
df_t = tfidfvectorizer.fit_transform(df['text'])

df_t
Out[15]: 
<21175x201380 sparse matrix of type '<class 'numpy.float64'>'
    with 6055621 stored elements in Compressed Sparse Row format>

I have tried following the example in this post, although my aim is not to display the features, but just to select the top n for each document before training. But I get a memory error as my data is too large to be converted into a dense matrix.

df_t_sorted = np.argsort(df_t.toarray()).flatten()[::1][n]
Traceback (most recent call last):

  File "<ipython-input-16-e0a74c393ca5>", line 1, in <module>
    df_t_sorted = np.argsort(df_t.toarray()).flatten()[::1][n]

  File "C:\Users\Me\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\sparse\compressed.py", line 943, in toarray
    out = self._process_toarray_args(order, out)

  File "C:\Users\Me\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\sparse\base.py", line 1130, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)

MemoryError

Is there any way to do what I want without working with a dense representation (i.e. without the toarray() call) and without reducing the feature space too much more than I already have (with min_df)?

Note: the max_features parameter is not what I want as it only considers "the top max_features ordered by term frequency across the corpus" (docs here) and what I want is a document-level ranking.

EDIT: I wonder if the best way to address this problem is to set the values of all features except the n-best to zero. I say this because the vocabulary has already been calculated, so feature indices must remain the same, as I will want to use them for other purposes (e.g. to visualise the actual words that correspond to the n-best features).

A colleague wrote some code to retrieve the indices of the n highest-ranked features:

n = 2
tops = np.zeros((df_t.shape[0], n), dtype=int) # store the top indices in a new array
for ind in range(df_t.shape[0]):
    tops[ind,] = np.argsort(-df_t[ind].toarray())[0, 0:n] # for each row (i.e. document) sort the (inversed, as argsort is ascending) list and slice top n

But from there, I would need to either:

  1. retrieve the list of remaining (i.e. lowest-ranked) indices and modify the values "in place", or
  2. loop through the original matrix (df_t) and set all values to 0 except for the n best indices in tops.

There is a post here explaining how to work with a csr_matrix, but I'm not sure how to put this into practice to get what I want.

like image 702
ongenz Avatar asked Oct 24 '18 15:10

ongenz


People also ask

Is TF-IDF a feature selection?

TF-IDF acronym for Term Frequency & Inverse Document Frequency is a powerful feature engineering technique used to identify the important words or more precisely rare words in the text data.

Why TF-IDF Vectorizer is used?

It's a fundamental process in natural language processing because none of the machine learning algorithms understand a text, not even computers. Text vectorization algorithm namely TF-IDF vectorizer, which is a very popular approach for traditional machine learning algorithms can help in transforming text into vectors.

What does TF-IDF transform do?

Conclusion. TF-IDF (Term Frequency - Inverse Document Frequency) is a handy algorithm that uses the frequency of words to determine how relevant those words are to a given document. It's a relatively simple but intuitive approach to weighting words, allowing it to act as a great jumping off point for a variety of tasks ...

What is TfidfVectorizer in NLP?

TfidfVectorizer is the base building block of many NLP pipelines. It is a simple technique to vectorize text documents — i.e. transform sentences into arrays of numbers — and use them in subsequent tasks.


2 Answers

from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(tokenizer=word_tokenize,ngram_range=(1,2), binary=True, max_features=50)
TFIDF=vect.fit_transform(df['processed_cv_data'])

The max_features parameter passed in the TfidfVectorizer will pick out the top 50 features ordered by their term frequency but not by their Tf-idf score. You can view the features by using:

print(vect.get_feature_names())
like image 77
Harsha Reddy Avatar answered Sep 19 '22 23:09

Harsha Reddy


As you mention, the max_features parameter of the TfidfVectorizer is one way of selecting features.

If you are looking for an alternative way which takes the relationship to the target variable into account, you can use sklearn's SelectKBest. By setting k=50, this will filter your data for the best features. The metric to use for selection can be specified as the parameter score_func.

Example:

from sklearn.feature_selection import SelectKBest

tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='english', 
                          token_pattern='[A-Za-z][\w\-]*', max_df=0.25)

df_t = tfidfvectorizer.fit_transform(df['text'])
df_t_reduced = SelectKBest(k=50).fit_transform(df_t, df['target'])

You can also chain it in a pipeline:

pipeline = Pipeline([("vectorizer", TfidfVectorizer()),
                     ("feature_reduction", SelectKBest(k=50)),
                     ("classifier", classifier)])
like image 30
Glyph Avatar answered Sep 20 '22 23:09

Glyph