I'm trying to get words that are distinctive of certain documents using the TfIDFVectorizer class in scikit-learn. It creates a tfidf matrix with all the words and their scores in all the documents, but then it seems to count common words, as well. This is some of the code I'm running:
vectorizer = TfidfVectorizer() tfidf_matrix = vectorizer.fit_transform(contents) feature_names = vectorizer.get_feature_names() dense = tfidf_matrix.todense() denselist = dense.tolist() df = pd.DataFrame(denselist, columns=feature_names, index=characters) s = pd.Series(df.loc['Adam']) s[s > 0].sort_values(ascending=False)[:10]
I expected this to return a list of distinctive words for the document 'Adam', but what it does it return a list of common words:
and 0.497077 to 0.387147 the 0.316648 of 0.298724 in 0.186404 with 0.144583 his 0.140998
I might not understand it perfectly, but as I understand it, tf-idf is supposed to find words that are distinctive of one document in a corpus, finding words that appear frequently in one document, but not in other documents. Here, and
appears frequently in other documents, so I don't know why it's returning a high value here.
The complete code I'm using to generate this is in this Jupyter notebook.
When I compute tf/idfs semi-manually, using the NLTK and computing scores for each word, I get the appropriate results. For the 'Adam' document:
fresh 0.000813 prime 0.000813 bone 0.000677 relate 0.000677 blame 0.000677 enough 0.000677
That looks about right, since these are words that appear in the 'Adam' document, but not as much in other documents in the corpus. The complete code used to generate this is in this Jupyter notebook.
Am I doing something wrong with the scikit code? Is there another way to initialize this class where it returns the right results? Of course, I can ignore stopwords by passing stop_words = 'english'
, but that doesn't really solve the problem, since common words of any sort shouldn't have high scores here.
Scikit-learn's CountVectorizer is used to transform a corpora of text to a vector of term / token counts. It also provides the capability to preprocess your text data prior to generating the vector representation making it a highly flexible feature representation module for text.
The formula that is used to compute the tf-idf for a term t of a document d in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and the idf is computed as idf(t) = log [ n / df(t) ] + 1 (if smooth_idf=False ), where n is the total number of documents in the document set and df(t) is the document frequency of t; the ...
TFIDF works by proportionally increasing the number of times a word appears in the document but is counterbalanced by the number of documents in which it is present. Hence, words like 'this', 'are' etc., that are commonly present in all the documents are not given a very high rank.
TfidfVectorizer - Transforms text to feature vectors that can be used as input to estimator. vocabulary_ Is a dictionary that converts each token (word) to feature index in the matrix, each unique token gets a feature index.
From scikit-learn documentation:
As tf–idf is very often used for text features, there is also another class called TfidfVectorizer that combines all the options of CountVectorizer and TfidfTransformer in a single model.
As you can see, TfidfVectorizer is a CountVectorizer followed by TfidfTransformer.
What you are probably looking for is TfidfTransformer and not TfidfVectorizer
I believe your issue lies in using different stopword lists. Scikit-learn and NLTK use different stopword lists by default. For scikit-learn it is usually a good idea to have a custom stop_words list passed to TfidfVectorizer, e.g.:
my_stopword_list = ['and','to','the','of'] my_vectorizer = TfidfVectorizer(stop_words=my_stopword_list)
Doc page for TfidfVectorizer class: [http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html][1]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With