Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to choose parameters in TfidfVectorizer in sklearn during unsupervised clustering

TfidfVectorizer provides an easy way to encode & transform texts into vectors.

My question is how to choose the proper values for parameters such as min_df, max_features, smooth_idf, sublinear_tf?

update:

Maybe I should have put more details on the question:

What if I am doing unsupervised clustering with bunch of texts. and I don't have any labels for the texts & I don't know how many clusters there might be (which is actually what I am trying to figure out)

like image 347
user6396 Avatar asked May 19 '17 09:05

user6396


People also ask

How TF-IDF is calculated in Sklearn?

The formula that is used to compute the tf-idf for a term t of a document d in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and the idf is computed as idf(t) = log [ n / df(t) ] + 1 (if smooth_idf=False ), where n is the total number of documents in the document set and df(t) is the document frequency of t; the ...

How does TF-IDF Vectorizer work?

TFIDF works by proportionally increasing the number of times a word appears in the document but is counterbalanced by the number of documents in which it is present. Hence, words like 'this', 'are' etc., that are commonly present in all the documents are not given a very high rank.

What is Ngram_range in TfIdfVectorizer?

ngram_range. The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of c(1, 1) means only unigrams, c(1, 2) means unigrams and bigrams, and c(2, 2) means only bigrams.

Does TfIdfVectorizer do Stemming?

In particular, we pass the TfIdfVectorizer our own function that performs custom tokenization and stemming, but we use scikit-learn's built in stop word remove rather than NLTK's.

What is tfidfvectorizer sklearn?

Sklearn tfidfvectorizer example : In this tutorial we are going to learn the Tfidfvectorizer sklearn in python and its detail use. The tf is called as the term frequency and see how many times a single document appears and understand the word. The word is common and appears with the high frequency in all documents.

What is the tf-idf vectorizer?

It is part of a vector consisting of a unique word which is large. The TF-IDF is built and uses the vector to cluster the document. Tfidfvectorizer is called the transform to normalize the tf-idf representation. It transforms the count matrix to normalize or tf-idf.

How does tftf-IDF measure document frequency?

TF-IDF use two statistical methods, first is Term Frequency and the other is Inverse Document Frequency. Term frequency refers to the total number of times a given term t appears in the document doc against (per) the total number of all words in the document and The inverse document frequency measure of how much information the word provides.

What is tf-idf?

Last Updated : 15 Oct, 2019 Now, you are searching for tf-idf, then you may familiar with feature extraction and what it is. TF-IDF which stands for Term Frequency – Inverse Document Frequency. It is one of the most important techniques used for information retrieval to represent how important a specific word or phrase is to a given document.


1 Answers

If you are, for instance, using these vectors in a classification task, you can vary these parameters (and of course also the parameters of the classifier) and see which values give you the best performance.

You can do that in sklearn easily with the GridSearchCV and Pipeline objects

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words=stop_words)),
    ('clf', OneVsRestClassifier(MultinomialNB(
        fit_prior=True, class_prior=None))),
])
parameters = {
    'tfidf__max_df': (0.25, 0.5, 0.75),
    'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
    'clf__estimator__alpha': (1e-2, 1e-3)
}

grid_search_tune = GridSearchCV(pipeline, parameters, cv=2, n_jobs=2, verbose=3)
grid_search_tune.fit(train_x, train_y)

print("Best parameters set:")
print grid_search_tune.best_estimator_.steps
like image 56
David Batista Avatar answered Oct 23 '22 16:10

David Batista