Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which 10 words has the highest TF-IDF value in each document / total?

I am trying to get the words with the 10 highest TF-IDF scores for each document.

I have a column in my dataframe that contains the preprocessed text (without punctuation, stop words, etc.) from my various documents. One row means one document in this example.

my dataframe

It has over 500 rows and I am curious about the most important words in each row.

So I ran the following code:

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(df['liststring'])
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()
denselist = dense.tolist()
df2 = pd.DataFrame(denselist, columns=feature_names)

Which gives me a TF-IDF matrix:

tf idf matrix

My question is, how can I collect the top 10 words that has the highest TF-IDF value? It would be nice to make a column in my original dataframe (df) that contains the top 10 words for each row, but also know which words are the most important in total.

like image 928
RozsOverFlow Avatar asked Sep 05 '25 03:09

RozsOverFlow


1 Answers

Minimal reproducible example for 20newsgroups dataset is:

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

X,y = fetch_20newsgroups(return_X_y = True)
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(X).toarray()
vocab = tfidf.vocabulary_
reverse_vocab = {v:k for k,v in vocab.items()}

feature_names = tfidf.get_feature_names()
df_tfidf = pd.DataFrame(X_tfidf, columns = feature_names)

idx = X_tfidf.argsort(axis=1)

tfidf_max10 = idx[:,-10:]

df_tfidf['top10'] = [[reverse_vocab.get(item) for item in row] for row in tfidf_max10 ]

df_tfidf['top10']

0        [this, was, funky, rac3, bricklin, tellme, umd...
1        [1qvfo9innc3s, upgrade, experiences, carson, k...
2        [heard, anybody, 160, display, willis, powerbo...
3        [joe, green, csd, iastate, jgreen, amber, p900...
4        [tom, n3p, c5owcb, expected, std, launch, jona...
                               ...                        
11309    [millie, diagnosis, headache, factory, scan, j...
11310    [plus, jiggling, screen, bodin, blank, mac, wi...
11311    [weight, ended, vertical, socket, the, westes,...
11312    [central, steven, steve, collins, bolson, hcrl...
11313    [california, kjg, 2101240, willow, jh2sc281xpm...
Name: top10, Length: 11314, dtype: object

To get top 10 features with highest TfIdf, please use:

global_top10_idx = X_tfidf.max(axis=0).argsort()[-10:]
np.asarray(feature_names)[global_top10_idx]

Please ask if something is not clear.

like image 116
Sergey Bushmanov Avatar answered Sep 07 '25 16:09

Sergey Bushmanov