Keep TFIDF result for predicting new content using Scikit for Python

Tags:

I am using sklearn on Python to do some clustering. I've trained 200,000 data, and code below works well.

corpus = open("token_from_xml.txt")
vectorizer = CountVectorizer(decode_error="replace")
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))
km = KMeans(30)
kmresult = km.fit(tfidf).predict(tfidf)

But when I have new testing content, I'd like to cluster it to existed clusters I'd trained. So I'm wondering how to save IDF result, so that I can do TFIDF for the new testing content and make sure the result for new testing content have same array length.

Thanks in advance.

UPDATE

I may need to save "transformer" or "tfidf" variable to file(txt or others), if one of them contains the trained IDF result.

UPDATE

For example. I have the training data:

["a", "b", "c"]
["a", "b", "d"]

And do TFIDF, the result will contains 4 features(a,b,c,d)

When I TEST:

["a", "c", "d"]

to see which cluster(already made by k-means) it belongs to. TFIDF will only give the result with 3 features(a,c,d), so the clustering in k-means will fall. (If I test ["a", "b", "e"], there may have other problems.)

So how to store the features list for testing data (even more, store it in file)?

UPDATE

Solved, see answers below.

285

asked Apr 22 '15 04:04

lol.Wen

3 Answers

I successfully saved the feature list by saving vectorizer.vocabulary_, and reuse by CountVectorizer(decode_error="replace",vocabulary=vectorizer.vocabulary_)

Codes below:

corpus = np.array(["aaa bbb ccc", "aaa bbb ddd"])
vectorizer = CountVectorizer(decode_error="replace")
vec_train = vectorizer.fit_transform(corpus)
#Save vectorizer.vocabulary_
pickle.dump(vectorizer.vocabulary_,open("feature.pkl","wb"))

#Load it later
transformer = TfidfTransformer()
loaded_vec = CountVectorizer(decode_error="replace",vocabulary=pickle.load(open("feature.pkl", "rb")))
tfidf = transformer.fit_transform(loaded_vec.fit_transform(np.array(["aaa ccc eee"])))

That works. tfidf will have same feature length as trained data.

138

answered Sep 24 '22 06:09

lol.Wen

Instead of using the CountVectorizer for storing the vocabulary, the vocabulary of the tfidfvectorizer can be used directly.

Training phase:

from sklearn.feature_extraction.text import TfidfVectorizer

# tf-idf based vectors
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,2), stop_words = "english", lowercase = True, max_features = 500000)

# Fit the model
tf_transformer = tf.fit(corpus)

# Dump the file
pickle.dump(tf_transformer, open("tfidf1.pkl", "wb"))


# Testing phase
tf1 = pickle.load(open("tfidf1.pkl", 'rb'))

# Create new tfidfVectorizer with old vocabulary
tf1_new = TfidfVectorizer(analyzer='word', ngram_range=(1,2), stop_words = "english", lowercase = True,
                          max_features = 500000, vocabulary = tf1.vocabulary_)
X_tf1 = tf1_new.fit_transform(new_corpus)

The fit_transform works here as we are using the old vocabulary. If you were not storing the tfidf, you would have just used transform on the test data. Even when you are doing a transform there, the new documents from the test data are being "fit" to the vocabulary of the vectorizer of the train. That is exactly what we are doing here. The only thing we can store and re-use for a tfidf vectorizer is the vocabulary.

answered Sep 24 '22 06:09

Arjun Mishra

If you want to store features list for testing data for use in future, you can do this:

tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))

#store the content
with open("x_result.pkl", 'wb') as handle:
                    pickle.dump(tfidf, handle)
#load the content
tfidf = pickle.load(open("x_result.pkl", "rb" ) )

answered Sep 20 '22 06:09

user123

Related questions
                            
                                ElementTree findall() returning empty list
                            
                                Vectorized look-up of values in Pandas dataframe
                            
                                Counting the amount of occurrences in a list of tuples
                            
                                Making py.test, coverage and tox work together: __init__.py in tests folder?
                            
                                Python: How to get values of an array at certain index positions?
                            
                                AttributeError: 'set' object has no attribute 'items'
                            
                                Infinite for loops possible in Python?
                            
                                Advantages of Using MethodType in Python
                            
                                Use Flask to convert a Pandas dataframe to CSV and serve a download
                            
                                Value error trying to install Python for Windows extensions
                            
                                Django: Can't render STATIC_URL from settings in template
                            
                                Python NLTK: How to tag sentences with the simplified set of part-of-speech tags?
                            
                                Python rounding error with float numbers [duplicate]
                            
                                URL encoding in python
                            
                                img = Image.open(fp) AttributeError: class Image has no attribute 'open'
                            
                                In python, how do you import all classes from another module without keeping the imported module's namespace?
                            
                                Flask-SQLAlchemy - model has no attribute 'foreign_keys'
                            
                                copy cell style openpyxl
                            
                                Reverse DataFrame column order
                            
                                Scikit-learn: How to run KMeans on a one-dimensional array?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Keep TFIDF result for predicting new content using Scikit for Python

Tags:

python

machine-learning

scikit-learn

tf-idf

lol.Wen

People also ask

3 Answers

lol.Wen

Arjun Mishra

user123

Recent Activity

Donate For Us