how to choose parameters in TfidfVectorizer in sklearn during unsupervised clustering

Q: Does TfIdfVectorizer do Stemming?

In particular, we pass the TfIdfVectorizer our own function that performs custom tokenization and stemming, but we use scikit-learn's built in stop word remove rather than NLTK's.

Q: What is tfidfvectorizer sklearn?

Sklearn tfidfvectorizer example : In this tutorial we are going to learn the Tfidfvectorizer sklearn in python and its detail use. The tf is called as the term frequency and see how many times a single document appears and understand the word. The word is common and appears with the high frequency in all documents.

Q: What is the tf-idf vectorizer?

It is part of a vector consisting of a unique word which is large. The TF-IDF is built and uses the vector to cluster the document. Tfidfvectorizer is called the transform to normalize the tf-idf representation. It transforms the count matrix to normalize or tf-idf.

Q: How does tftf-IDF measure document frequency?

TF-IDF use two statistical methods, first is Term Frequency and the other is Inverse Document Frequency. Term frequency refers to the total number of times a given term t appears in the document doc against (per) the total number of all words in the document and The inverse document frequency measure of how much information the word provides.

Q: What is tf-idf?

Last Updated : 15 Oct, 2019 Now, you are searching for tf-idf, then you may familiar with feature extraction and what it is. TF-IDF which stands for Term Frequency – Inverse Document Frequency. It is one of the most important techniques used for information retrieval to represent how important a specific word or phrase is to a given document.

Tags:

python

nlp

scikit-learn

tf-idf

tfidfvectorizer

TfidfVectorizer provides an easy way to encode & transform texts into vectors.

My question is how to choose the proper values for parameters such as min_df, max_features, smooth_idf, sublinear_tf?

update:

Maybe I should have put more details on the question:

What if I am doing unsupervised clustering with bunch of texts. and I don't have any labels for the texts & I don't know how many clusters there might be (which is actually what I am trying to figure out)

347

asked May 19 '17 09:05

user6396

1 Answers

If you are, for instance, using these vectors in a classification task, you can vary these parameters (and of course also the parameters of the classifier) and see which values give you the best performance.

You can do that in sklearn easily with the GridSearchCV and Pipeline objects

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words=stop_words)),
    ('clf', OneVsRestClassifier(MultinomialNB(
        fit_prior=True, class_prior=None))),
])
parameters = {
    'tfidf__max_df': (0.25, 0.5, 0.75),
    'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
    'clf__estimator__alpha': (1e-2, 1e-3)
}

grid_search_tune = GridSearchCV(pipeline, parameters, cv=2, n_jobs=2, verbose=3)
grid_search_tune.fit(train_x, train_y)

print("Best parameters set:")
print grid_search_tune.best_estimator_.steps

answered Oct 23 '22 16:10

David Batista

Related questions
                            
                                Python - Find line number from text file [closed]
                            
                                Python - Calculate Hierarchical clustering of word2vec vectors and plot the results as a dendrogram
                            
                                Speeding up Pandas to_sql()?
                            
                                python pycparser setup error
                            
                                How to use __init__.py in (sub-)modules to define namespaces?
                            
                                Python Bokeh send additional parameters to widget event handler
                            
                                python regex to replace all single word characters in string
                            
                                SQLAlchemy AttributeError: 'Query' object has no attribute '_sa_instance_state' when retrieving from database
                            
                                gensim word2vec - array dimensions in updating with online word embedding
                            
                                Retrieve attribute names and values with Python / lxml and XPath
                            
                                Why does random sampling scale with the dataset not the sample size? (pandas .sample() example)
                            
                                Keras VGG16 fine tuning
                            
                                Speed of SVM Kernels? Linear vs RBF vs Poly
                            
                                How do I install a package for different Python versions in Anaconda?
                            
                                Declaring a number in Python. Possible to emphasize thousand?
                            
                                pandas rolling max with groupby
                            
                                find numeric column names in Pandas
                            
                                Simple Python Logging Exception from Future
                            
                                Error installing ipdb for Python 2.7 using virtualenv and pip
                            
                                python ValueError: Start must be in dates. Got 2016-01-01 | 2016-01-01 00:00:00

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With