Select top n TFIDF features for a given document

Tags:

I am working with TFIDF sparse matrices for document classification and want to retain only the top n (say 50) terms for each document (ranked by TFIDF score). See EDIT below.

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='english', 
                              token_pattern='[A-Za-z][\w\-]*', max_df=0.25)
n = 50

df = pd.read_pickle('my_df.pickle')
df_t = tfidfvectorizer.fit_transform(df['text'])

df_t
Out[15]: 
<21175x201380 sparse matrix of type '<class 'numpy.float64'>'
    with 6055621 stored elements in Compressed Sparse Row format>

I have tried following the example in this post, although my aim is not to display the features, but just to select the top n for each document before training. But I get a memory error as my data is too large to be converted into a dense matrix.

df_t_sorted = np.argsort(df_t.toarray()).flatten()[::1][n]
Traceback (most recent call last):

  File "<ipython-input-16-e0a74c393ca5>", line 1, in <module>
    df_t_sorted = np.argsort(df_t.toarray()).flatten()[::1][n]

  File "C:\Users\Me\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\sparse\compressed.py", line 943, in toarray
    out = self._process_toarray_args(order, out)

  File "C:\Users\Me\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\sparse\base.py", line 1130, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)

MemoryError

Is there any way to do what I want without working with a dense representation (i.e. without the toarray() call) and without reducing the feature space too much more than I already have (with min_df)?

Note: the max_features parameter is not what I want as it only considers "the top max_features ordered by term frequency across the corpus" (docs here) and what I want is a document-level ranking.

EDIT: I wonder if the best way to address this problem is to set the values of all features except the n-best to zero. I say this because the vocabulary has already been calculated, so feature indices must remain the same, as I will want to use them for other purposes (e.g. to visualise the actual words that correspond to the n-best features).

A colleague wrote some code to retrieve the indices of the n highest-ranked features:

n = 2
tops = np.zeros((df_t.shape[0], n), dtype=int) # store the top indices in a new array
for ind in range(df_t.shape[0]):
    tops[ind,] = np.argsort(-df_t[ind].toarray())[0, 0:n] # for each row (i.e. document) sort the (inversed, as argsort is ascending) list and slice top n

But from there, I would need to either:

retrieve the list of remaining (i.e. lowest-ranked) indices and modify the values "in place", or
loop through the original matrix (df_t) and set all values to 0 except for the n best indices in tops.

There is a post here explaining how to work with a csr_matrix, but I'm not sure how to put this into practice to get what I want.

702

asked Oct 24 '18 15:10

ongenz

2 Answers

from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(tokenizer=word_tokenize,ngram_range=(1,2), binary=True, max_features=50)
TFIDF=vect.fit_transform(df['processed_cv_data'])

The max_features parameter passed in the TfidfVectorizer will pick out the top 50 features ordered by their term frequency but not by their Tf-idf score. You can view the features by using:

print(vect.get_feature_names())

answered Sep 19 '22 23:09

Harsha Reddy

As you mention, the max_features parameter of the TfidfVectorizer is one way of selecting features.

If you are looking for an alternative way which takes the relationship to the target variable into account, you can use sklearn's SelectKBest. By setting k=50, this will filter your data for the best features. The metric to use for selection can be specified as the parameter score_func.

Example:

from sklearn.feature_selection import SelectKBest

tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='english', 
                          token_pattern='[A-Za-z][\w\-]*', max_df=0.25)

df_t = tfidfvectorizer.fit_transform(df['text'])
df_t_reduced = SelectKBest(k=50).fit_transform(df_t, df['target'])

You can also chain it in a pipeline:

pipeline = Pipeline([("vectorizer", TfidfVectorizer()),
                     ("feature_reduction", SelectKBest(k=50)),
                     ("classifier", classifier)])

answered Sep 20 '22 23:09

Glyph

Related questions
                            
                                Download pretrained ImageNet model of ResNet, VGG, etc. (.PB file)
                            
                                python3 command not found after installing python with pyenv
                            
                                plot mouse movement Python
                            
                                Split Column containing lists into different rows in pandas [duplicate]
                            
                                How to use pyunpack to unpack .7z file?
                            
                                how to efficiently split a large dataframe into many parquet files?
                            
                                how to get standardised (Beta) coefficients for multiple linear regression using statsmodels
                            
                                How to test a custom loss function in keras?
                            
                                Increasing n_jobs has no effect on GridSearchCV
                            
                                Error: Classification metrics can't handle a mix of multiclass-multioutput and multilabel-indicator targets
                            
                                How to update several attributes of an item in dynamodb using boto3
                            
                                Using Wagtail as an API layer
                            
                                Trio execution time without IO operations
                            
                                configparser.ParsingError: Source contains parsing errors: 'my.ini'
                            
                                pandas Series.value_counts returns inconsistent order for equal count strings
                            
                                How to use ast.literal_eval in a pandas dataframe and handle exceptions
                            
                                What exactly is a matplotlib axes object?
                            
                                Numpy filter using condition on each element
                            
                                Set tkinter icon on Mac OS
                            
                                How to determine an overfitted model based on loss precision and recall

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Select top n TFIDF features for a given document

Tags:

python

scikit-learn

sparse-matrix

text-classification

tf-idf

ongenz

People also ask

2 Answers

Harsha Reddy

Glyph

Recent Activity

Donate For Us