Obtain tf-idf weights of words with sklearn

Tags:

I have a set of texts of wikipedia.
Using tf-idf, I can define the weight of each word. Below is the code:

import pandas as pd                                             
from sklearn.feature_extraction.text import TfidfVectorizer

wiki = pd.read_csv('people_wiki.csv')

tfidf_vectorizer = TfidfVectorizer(max_features= 1000000)
tfidf = tfidf_vectorizer.fit_transform(wiki['text'])

The goal is to see the weights like shown in the tf-idf column:

enter image description here

The file 'people_wiki.csv' is here:

https://ufile.io/udg1y

524

asked Jul 21 '17 08:07

nunodsousa

1 Answers

TfidfVectorizer has a vocabulary_ attribute which is very useful for what you want. This attribute is dictionary with words as keys and the corresponding column index that word has as value.

For the below example I want the inverse of that dictionary for that I use a dictionary comprehension.

tfidf_vec = TfidfVectorizer()
transformed = tfidf_vec.fit_transform(raw_documents=['this is a quick example','just to show off'])
index_value={i[1]:i[0] for i in tfidf_vec.vocabulary_.items()}

index_value will be used as a lookup table further on.

fit_transform returns a Compressed Sparse Row format matrix. The attributes which are useful for what you want to achieve are indices and data. indices returns all the indices that actually contain data and data returns all the data in those indices.

Looping over the returned transformed sparse matrix as follows.

fully_indexed = []
for row in transformed:
    fully_indexed.append({index_value[column]:value for (column,value) in zip(row.indices,row.data)})

returns a list of dictionaries with the following contents.

[{'example': 0.5, 'is': 0.5, 'quick': 0.5, 'this': 0.5},
 {'just': 0.5, 'off': 0.5, 'show': 0.5, 'to': 0.5}]

Please note that doing it this way only returns words that have a non zero value for a specific document. Looking at the first document in my example there is no 'just', 0.0 key value pair in the dictionary. If you want to include those you need to tweak the final dictionary comprehension a bit.

Like so

fully_indexed = []
transformed = np.array(transformed.todense())
for row in transformed:
    fully_indexed.append({index_value[column]:value for (column,value) in enumerate(row)})

We create a dense version of the matrix as a numpy array loop over each row of the numpy array enumerate the contents and then fill the list of dictionaries. Doing it this way results in output that also includes all words that were not present in a document.

[{'example': 0.5,'is': 0.5,'just': 0.0,'off': 0.0,'quick': 0.5,'show': 0.0,'this': 0.5,'to': 0.0},
 {'example': 0.0,'is': 0.0,'just': 0.5,'off': 0.5,'quick': 0.0,'show': 0.5,'this': 0.0,'to': 0.5}]

You can then add the dictionaries to your dataframe.

df['tf_idf'] = fully_indexed

answered Nov 15 '22 05:11

error

Related questions
                            
                                How to get the params from a saved XGBoost model
                            
                                Delete all in a Many to Many secondary table association in sqlalchemy
                            
                                Django model unique_together on primary key and unique constraints
                            
                                No module named 'requests_toolbelt'
                            
                                Find days since last event pandas dataframe
                            
                                Python: Why is the multiprocessing lock shared among processes here?
                            
                                Using Alternative Distance Metrics like Mahalanobis with DBSCAN
                            
                                How to define a large list array in Python using For loop or Vectorization?
                            
                                Make Singleton class in Multiprocessing
                            
                                Adjusting tick settings on Seaborn heatmap
                            
                                How to use pos_hint with FloatLayout in kivy?
                            
                                Increase Height of QPushButton in PyQT
                            
                                How to plot the confusion/similarity matrix of a K-mean algorithm
                            
                                How do I create dynamic variable names inside a loop in pandas
                            
                                Printing return value in function
                            
                                post request with \n-delimited JSON in python
                            
                                How can I check if a string contains a number between two brackets and return the location?
                            
                                Randomly selecting a different pair of items from a list
                            
                                How to reliably separate decimal and floating parts from a number?
                            
                                eval fails in list comprehension [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Obtain tf-idf weights of words with sklearn

Tags:

python

machine-learning

nlp

scikit-learn

tf-idf

nunodsousa

People also ask

1 Answers

error

Recent Activity

Donate For Us