Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to classify new documents with tf-idf?

If I use the TfidfVectorizer from sklearn to generate feature vectors as:

features = TfidfVectorizer(min_df=0.2, ngram_range=(1,3)).fit_transform(myDocuments)

How would I then generate feature vectors to classify a new document? Since you cant calculate the tf-idf for a single document.

Would it be a correct approach, to extract the feature names with:

feature_names = TfidfVectorizer.get_feature_names()

and then count the term frequency for the new document according to the feature_names?

But then I won't get the weights that have the information of a words importance.

like image 333
Isbister Avatar asked Oct 18 '16 15:10

Isbister


2 Answers

You need to save the instance of the TfidfVectorizer, it will remember the term frequencies and vocabulary that was used to fit it. It may make things clearer sense if rather than using fit_transform, you use fit and transform separately:

vec = TfidfVectorizer(min_df=0.2, ngram_range=(1,3))
vec.fit(myDocuments)
features = vec.transform(myDocuments)
new_features = fec.transform(myNewDocuments)
like image 78
maxymoo Avatar answered Oct 19 '22 18:10

maxymoo


I would rather use gensim with a Latent Semantic Indexing as a wrapper over the original corpus: bow->tfidf->lsi

tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=300)
corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi

Then if you need to continue the training:

new_tfidf = models.TfidfModel(corpus)
new_corpus_tfidf = new_tfidf[corpus]
lsi.add_documents(another_tfidf_corpus) # now LSI has been trained on corpus_tfidf + another_tfidf_corpus
lsi_vec = model[tfidf_vec] # convert some new document into the LSI space

Where corpus is bag-of-words

As you can read in their tutorials:
LSI training is unique in that we can continue “training” at any point, simply by providing more training documents. This is done by incremental updates to the underlying model, in a process called online training. Because of this feature, the input document stream may even be infinite – just keep feeding LSI new documents as they arrive, while using the computed transformation model as read-only in the meanwhile!

If you like sci-kit, gensim is also compatible with numpy

like image 39
Mattia Fantoni Avatar answered Oct 19 '22 19:10

Mattia Fantoni