Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Updating the feature names into scikit TFIdfVectorizer

I am trying out this code

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

train_data = ["football is the sport","gravity is the movie", "education is imporatant"]
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                                                 stop_words='english')

print "Applying first train data"
X_train = vectorizer.fit_transform(train_data)
print vectorizer.get_feature_names()

print "\n\nApplying second train data"
train_data = ["cricket", "Transformers is a film","AIMS is a college"]
X_train = vectorizer.transform(train_data)
print vectorizer.get_feature_names()

print "\n\nApplying fit transform onto second train data"
X_train = vectorizer.fit_transform(train_data)
print vectorizer.get_feature_names()

The output for this one is

Applying first train data
[u'education', u'football', u'gravity', u'imporatant', u'movie', u'sport']


Applying second train data
[u'education', u'football', u'gravity', u'imporatant', u'movie', u'sport']


 Applying fit transform onto second train data
[u'aims', u'college', u'cricket', u'film', u'transformers']

I gave the first set of data using fit_transform to vectorizer so it gave me feature names like [u'education', u'football', u'gravity', u'imporatant', u'movie', u'sport'] after that i applied another train set to the same vectorizer but it gave me the same feature names as I didnt use fit or fit_transform. But I want to know how to update the features of a vectorizer without overwriting the previous oncs. If I use fit_transform again the previous features will get overwritten. So I want to update the feature list of the vectorizer. So i want something like [u'education', u'football', u'gravity', u'imporatant', u'movie', u'sport',u'aims', u'college', u'cricket', u'film', u'transformers'] How can I get that.

like image 797
Gunjan Avatar asked Aug 06 '14 07:08

Gunjan


People also ask

How do you get a feature name from TfidfVectorizer?

You can use tfidf_vectorizer. get_feature_names() . This will print feature names selected (terms selected) from the raw documents. You can also use tfidf_vectorizer.

What is TfidfVectorizer in Sklearn?

tf-idf is used to classify documents, ranking in search engine. tf: term frequency(count of the words present in document from its own vocabulary), idf: inverse document frequency(importance of the word to each document).

What is from Sklearn Feature_extraction text import TfidfVectorizer?

text . TfidfVectorizer. Convert a collection of raw documents to a matrix of TF-IDF features.

What is the difference between TfidfVectorizer and Tfidftransformer?

With Tfidftransformer you will systematically compute word counts using CountVectorizer and then compute the Inverse Document Frequency (IDF) values and only then compute the Tf-idf scores. With Tfidfvectorizer on the contrary, you will do all three steps at once.


2 Answers

In sklearn terminology, this is called a partial fit and you can't do it with a TfidfVectorizer. There are two ways around this:

  • Concatenate the two training sets and re-vectorize
  • use a HashingVectorizer, which support partial fitting. However, that does not have a get_feature_names method due to the fact that is hashes features, so the original isn't kept. Another advantage is that this is much more memory efficient.

Example of the first approach:

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

train_data1 = ["football is the sport", "gravity is the movie", "education is important"]
vectorizer = TfidfVectorizer(stop_words='english')

print("Applying first train data")
X_train = vectorizer.fit_transform(train_data1)
print(vectorizer.get_feature_names())

print("\n\nApplying second train data")
train_data2 = ["cricket", "Transformers is a film", "AIMS is a college"]
X_train = vectorizer.transform(train_data2)
print(vectorizer.get_feature_names())

print("\n\nApplying fit transform onto second train data")
X_train = vectorizer.fit_transform(train_data1 + train_data2)
print(vectorizer.get_feature_names())

Output:

Applying first train data
['education', 'football', 'gravity', 'important', 'movie', 'sport']

Applying second train data
['education', 'football', 'gravity', 'important', 'movie', 'sport']

Applying fit transform onto second train data
['aims', 'college', 'cricket', 'education', 'film', 'football', 'gravity', 'important', 'movie', 'sport', 'transformers']
like image 89
mbatchkarov Avatar answered Oct 17 '22 02:10

mbatchkarov


I found this question while googling for the same issue that OP raised. Like mbatchkarov said Scikit-Learn's TfidfVectorizer doesn't natively support partial fitting.

HashingVectorizer is usually a great alternative, but it really depends on your use-case. Specifically, if you care very much about representing infrequent terms precisely, then collisions will hurt performance.

So I went ahead and wrote my own implementation of "partial_fit" for both TfidfVectorizer and CountVectorizer (see here). Hope it's useful for other people reaching this post. Note that this kind of partial fitting does change the dimension of the output vector given by the vectorizer since the whole point is to update the vocabulary (so take this into account when using in a pipeline).

like image 25
Ido S Avatar answered Oct 17 '22 01:10

Ido S