If I use the TfidfVectorizer
from sklearn
to generate feature vectors as:
features = TfidfVectorizer(min_df=0.2, ngram_range=(1,3)).fit_transform(myDocuments)
How would I then generate feature vectors to classify a new document? Since you cant calculate the tf-idf for a single document.
Would it be a correct approach, to extract the feature names with:
feature_names = TfidfVectorizer.get_feature_names()
and then count the term frequency for the new document according to the feature_names
?
But then I won't get the weights that have the information of a words importance.
You need to save the instance of the TfidfVectorizer, it will remember the term frequencies and vocabulary that was used to fit it. It may make things clearer sense if rather than using fit_transform
, you use fit
and transform
separately:
vec = TfidfVectorizer(min_df=0.2, ngram_range=(1,3))
vec.fit(myDocuments)
features = vec.transform(myDocuments)
new_features = fec.transform(myNewDocuments)
I would rather use gensim with a Latent Semantic Indexing as a wrapper over the original corpus: bow->tfidf->lsi
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=300)
corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi
Then if you need to continue the training:
new_tfidf = models.TfidfModel(corpus)
new_corpus_tfidf = new_tfidf[corpus]
lsi.add_documents(another_tfidf_corpus) # now LSI has been trained on corpus_tfidf + another_tfidf_corpus
lsi_vec = model[tfidf_vec] # convert some new document into the LSI space
Where corpus is bag-of-words
As you can read in their tutorials:
LSI training is unique in that we can continue “training” at any point, simply by providing more training documents. This is done by incremental updates to the underlying model, in a process called online training. Because of this feature, the input document stream may even be infinite – just keep feeding LSI new documents as they arrive, while using the computed transformation model as read-only in the meanwhile!
If you like sci-kit, gensim is also compatible with numpy
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With