I have a set of texts of wikipedia.
Using tf-idf, I can define the weight of each word.
Below is the code:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
wiki = pd.read_csv('people_wiki.csv')
tfidf_vectorizer = TfidfVectorizer(max_features= 1000000)
tfidf = tfidf_vectorizer.fit_transform(wiki['text'])
The goal is to see the weights like shown in the tf-idf column:
The file 'people_wiki.csv' is here:
https://ufile.io/udg1y
Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.
The formula that is used to compute the tf-idf for a term t of a document d in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and the idf is computed as idf(t) = log [ n / df(t) ] + 1 (if smooth_idf=False ), where n is the total number of documents in the document set and df(t) is the document frequency of t; the ...
As its name implies, TF-IDF vectorizes/scores a word by multiplying the word's Term Frequency (TF) with the Inverse Document Frequency (IDF). Term Frequency: TF of a term or word is the number of times the term appears in a document compared to the total number of words in the document.
How to Compute: tf-idf is a weighting scheme that assigns each term in a document a weight based on its term frequency (tf) and inverse document frequency (idf). The terms with higher weight scores are considered to be more important. Typically, the tf-idf weight is composed by two terms- Normalized Term Frequency (tf)
TfidfVectorizer
has a vocabulary_
attribute which is very useful for what you want. This attribute is dictionary with words as keys and the corresponding column index that word has as value.
For the below example I want the inverse of that dictionary for that I use a dictionary comprehension.
tfidf_vec = TfidfVectorizer()
transformed = tfidf_vec.fit_transform(raw_documents=['this is a quick example','just to show off'])
index_value={i[1]:i[0] for i in tfidf_vec.vocabulary_.items()}
index_value
will be used as a lookup table further on.
fit_transform
returns a Compressed Sparse Row format matrix. The attributes which are useful for what you want to achieve are indices
and data
. indices
returns all the indices that actually contain data and data
returns all the data in those indices.
Looping over the returned transformed
sparse matrix as follows.
fully_indexed = []
for row in transformed:
fully_indexed.append({index_value[column]:value for (column,value) in zip(row.indices,row.data)})
returns a list of dictionaries with the following contents.
[{'example': 0.5, 'is': 0.5, 'quick': 0.5, 'this': 0.5},
{'just': 0.5, 'off': 0.5, 'show': 0.5, 'to': 0.5}]
Please note that doing it this way only returns words that have a non zero value for a specific document. Looking at the first document in my example there is no 'just', 0.0
key value pair in the dictionary. If you want to include those you need to tweak the final dictionary comprehension a bit.
Like so
fully_indexed = []
transformed = np.array(transformed.todense())
for row in transformed:
fully_indexed.append({index_value[column]:value for (column,value) in enumerate(row)})
We create a dense version of the matrix as a numpy array loop over each row of the numpy array enumerate the contents and then fill the list of dictionaries. Doing it this way results in output that also includes all words that were not present in a document.
[{'example': 0.5,'is': 0.5,'just': 0.0,'off': 0.0,'quick': 0.5,'show': 0.0,'this': 0.5,'to': 0.0},
{'example': 0.0,'is': 0.0,'just': 0.5,'off': 0.5,'quick': 0.0,'show': 0.5,'this': 0.0,'to': 0.5}]
You can then add the dictionaries to your dataframe.
df['tf_idf'] = fully_indexed
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With