Which 10 words has the highest TF-IDF value in each document / total?

Question

I am trying to get the words with the 10 highest TF-IDF scores for each document.

I have a column in my dataframe that contains the preprocessed text (without punctuation, stop words, etc.) from my various documents. One row means one document in this example.

my dataframe

It has over 500 rows and I am curious about the most important words in each row.

So I ran the following code:

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(df['liststring'])
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()
denselist = dense.tolist()
df2 = pd.DataFrame(denselist, columns=feature_names)

Which gives me a TF-IDF matrix:

tf idf matrix

My question is, how can I collect the top 10 words that has the highest TF-IDF value? It would be nice to make a column in my original dataframe (df) that contains the top 10 words for each row, but also know which words are the most important in total.

Sergey Bushmanov · Accepted Answer

Minimal reproducible example for 20newsgroups dataset is:

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

X,y = fetch_20newsgroups(return_X_y = True)
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(X).toarray()
vocab = tfidf.vocabulary_
reverse_vocab = {v:k for k,v in vocab.items()}

feature_names = tfidf.get_feature_names()
df_tfidf = pd.DataFrame(X_tfidf, columns = feature_names)

idx = X_tfidf.argsort(axis=1)

tfidf_max10 = idx[:,-10:]

df_tfidf['top10'] = [[reverse_vocab.get(item) for item in row] for row in tfidf_max10 ]

df_tfidf['top10']

0        [this, was, funky, rac3, bricklin, tellme, umd...
1        [1qvfo9innc3s, upgrade, experiences, carson, k...
2        [heard, anybody, 160, display, willis, powerbo...
3        [joe, green, csd, iastate, jgreen, amber, p900...
4        [tom, n3p, c5owcb, expected, std, launch, jona...
                               ...                        
11309    [millie, diagnosis, headache, factory, scan, j...
11310    [plus, jiggling, screen, bodin, blank, mac, wi...
11311    [weight, ended, vertical, socket, the, westes,...
11312    [central, steven, steve, collins, bolson, hcrl...
11313    [california, kjg, 2101240, willow, jh2sc281xpm...
Name: top10, Length: 11314, dtype: object

To get top 10 features with highest TfIdf, please use:

global_top10_idx = X_tfidf.max(axis=0).argsort()[-10:]
np.asarray(feature_names)[global_top10_idx]

Please ask if something is not clear.

Which 10 words has the highest TF-IDF value in each document / total?

Tags:

python

pandas

scikit-learn

tf-idf

tfidfvectorizer

RozsOverFlow

1 Answers

Sergey Bushmanov

Recent Activity

Donate For Us

Which 10 words has the highest TF-IDF value in each document / total?

Tags:

python

pandas

scikit-learn

tf-idf

tfidfvectorizer

RozsOverFlow

1 Answers

Sergey Bushmanov

Related questions

Recent Activity

Donate For Us