Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Obtain tf-idf weights of words with sklearn

I have a set of texts of wikipedia.
Using tf-idf, I can define the weight of each word. Below is the code:

import pandas as pd                                             
from sklearn.feature_extraction.text import TfidfVectorizer

wiki = pd.read_csv('people_wiki.csv')

tfidf_vectorizer = TfidfVectorizer(max_features= 1000000)
tfidf = tfidf_vectorizer.fit_transform(wiki['text'])

The goal is to see the weights like shown in the tf-idf column:

enter image description here

The file 'people_wiki.csv' is here:

https://ufile.io/udg1y

like image 524
nunodsousa Avatar asked Jul 21 '17 08:07

nunodsousa


People also ask

How are TF-IDF weights calculated?

Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.

How is TF-IDF calculated in Sklearn?

The formula that is used to compute the tf-idf for a term t of a document d in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and the idf is computed as idf(t) = log [ n / df(t) ] + 1 (if smooth_idf=False ), where n is the total number of documents in the document set and df(t) is the document frequency of t; the ...

How do I get my TF-IDF score?

As its name implies, TF-IDF vectorizes/scores a word by multiplying the word's Term Frequency (TF) with the Inverse Document Frequency (IDF). Term Frequency: TF of a term or word is the number of times the term appears in a document compared to the total number of words in the document.

What is TF-IDF term weighting?

How to Compute: tf-idf is a weighting scheme that assigns each term in a document a weight based on its term frequency (tf) and inverse document frequency (idf). The terms with higher weight scores are considered to be more important. Typically, the tf-idf weight is composed by two terms- Normalized Term Frequency (tf)


1 Answers

TfidfVectorizer has a vocabulary_ attribute which is very useful for what you want. This attribute is dictionary with words as keys and the corresponding column index that word has as value.

For the below example I want the inverse of that dictionary for that I use a dictionary comprehension.

tfidf_vec = TfidfVectorizer()
transformed = tfidf_vec.fit_transform(raw_documents=['this is a quick example','just to show off'])
index_value={i[1]:i[0] for i in tfidf_vec.vocabulary_.items()}

index_value will be used as a lookup table further on.

fit_transform returns a Compressed Sparse Row format matrix. The attributes which are useful for what you want to achieve are indices and data. indices returns all the indices that actually contain data and data returns all the data in those indices.

Looping over the returned transformed sparse matrix as follows.

fully_indexed = []
for row in transformed:
    fully_indexed.append({index_value[column]:value for (column,value) in zip(row.indices,row.data)})

returns a list of dictionaries with the following contents.

[{'example': 0.5, 'is': 0.5, 'quick': 0.5, 'this': 0.5},
 {'just': 0.5, 'off': 0.5, 'show': 0.5, 'to': 0.5}]

Please note that doing it this way only returns words that have a non zero value for a specific document. Looking at the first document in my example there is no 'just', 0.0 key value pair in the dictionary. If you want to include those you need to tweak the final dictionary comprehension a bit.

Like so

fully_indexed = []
transformed = np.array(transformed.todense())
for row in transformed:
    fully_indexed.append({index_value[column]:value for (column,value) in enumerate(row)})

We create a dense version of the matrix as a numpy array loop over each row of the numpy array enumerate the contents and then fill the list of dictionaries. Doing it this way results in output that also includes all words that were not present in a document.

[{'example': 0.5,'is': 0.5,'just': 0.0,'off': 0.0,'quick': 0.5,'show': 0.0,'this': 0.5,'to': 0.0},
 {'example': 0.0,'is': 0.0,'just': 0.5,'off': 0.5,'quick': 0.0,'show': 0.5,'this': 0.0,'to': 0.5}]

You can then add the dictionaries to your dataframe.

df['tf_idf'] = fully_indexed
like image 52
error Avatar answered Nov 15 '22 05:11

error