How to classify new documents with tf-idf?

Question

If I use the TfidfVectorizer from sklearn to generate feature vectors as:

features = TfidfVectorizer(min_df=0.2, ngram_range=(1,3)).fit_transform(myDocuments)

How would I then generate feature vectors to classify a new document? Since you cant calculate the tf-idf for a single document.

Would it be a correct approach, to extract the feature names with:

feature_names = TfidfVectorizer.get_feature_names()

and then count the term frequency for the new document according to the feature_names?

But then I won't get the weights that have the information of a words importance.

maxymoo · Accepted Answer

You need to save the instance of the TfidfVectorizer, it will remember the term frequencies and vocabulary that was used to fit it. It may make things clearer sense if rather than using fit_transform, you use fit and transform separately:

vec = TfidfVectorizer(min_df=0.2, ngram_range=(1,3))
vec.fit(myDocuments)
features = vec.transform(myDocuments)
new_features = fec.transform(myNewDocuments)

Mattia Fantoni · Answer

I would rather use gensim with a Latent Semantic Indexing as a wrapper over the original corpus: bow->tfidf->lsi

tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=300)
corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi

Then if you need to continue the training:

new_tfidf = models.TfidfModel(corpus)
new_corpus_tfidf = new_tfidf[corpus]
lsi.add_documents(another_tfidf_corpus) # now LSI has been trained on corpus_tfidf + another_tfidf_corpus
lsi_vec = model[tfidf_vec] # convert some new document into the LSI space

Where corpus is bag-of-words

As you can read in their tutorials:
LSI training is unique in that we can continue “training” at any point, simply by providing more training documents. This is done by incremental updates to the underlying model, in a process called online training. Because of this feature, the input document stream may even be infinite – just keep feeding LSI new documents as they arrive, while using the computed transformation model as read-only in the meanwhile!

If you like sci-kit, gensim is also compatible with numpy

How to classify new documents with tf-idf?

Tags:

python

text-mining

scikit-learn

tf-idf

text-analysis

Isbister

2 Answers

maxymoo

Mattia Fantoni

Recent Activity

Donate For Us

How to classify new documents with tf-idf?

Tags:

python

text-mining

scikit-learn

tf-idf

text-analysis

Isbister

2 Answers

maxymoo

Mattia Fantoni

Related questions

Recent Activity

Donate For Us