Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keep TFIDF result for predicting new content using Scikit for Python

I am using sklearn on Python to do some clustering. I've trained 200,000 data, and code below works well.

corpus = open("token_from_xml.txt")
vectorizer = CountVectorizer(decode_error="replace")
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))
km = KMeans(30)
kmresult = km.fit(tfidf).predict(tfidf)

But when I have new testing content, I'd like to cluster it to existed clusters I'd trained. So I'm wondering how to save IDF result, so that I can do TFIDF for the new testing content and make sure the result for new testing content have same array length.

Thanks in advance.

UPDATE

I may need to save "transformer" or "tfidf" variable to file(txt or others), if one of them contains the trained IDF result.

UPDATE

For example. I have the training data:

["a", "b", "c"]
["a", "b", "d"]

And do TFIDF, the result will contains 4 features(a,b,c,d)

When I TEST:

["a", "c", "d"]

to see which cluster(already made by k-means) it belongs to. TFIDF will only give the result with 3 features(a,c,d), so the clustering in k-means will fall. (If I test ["a", "b", "e"], there may have other problems.)

So how to store the features list for testing data (even more, store it in file)?

UPDATE

Solved, see answers below.

like image 285
lol.Wen Avatar asked Apr 22 '15 04:04

lol.Wen


People also ask

Which is better count Vectorizer or TF-IDF?

TF-IDF is better than Count Vectorizers because it not only focuses on the frequency of words present in the corpus but also provides the importance of the words. We can then remove the words that are less important for analysis, hence making the model building less complex by reducing the input dimensions.

What is TF-IDF Sklearn?

Deep understanding tf-idf calculation by various examples, Why is so efficiency than other vectorizer algorithm. TF-IDF is an abbreviation for Term Frequency Inverse Document Frequency.

Does TfidfVectorizer do Stemming?

In particular, we pass the TfIdfVectorizer our own function that performs custom tokenization and stemming, but we use scikit-learn's built in stop word remove rather than NLTK's. Then we call fit_transform which does a few things: first, it creates a dictionary of 'known' words based on the input text given to it.

How to use Python to calculate tf-idf in Python?

Using Python to calculate TF-IDF. Lets now code TF-IDF in Python from scratch. After that, we will see how we can use sklearn to automate the process. The function computeTF computes the TF score for each word in the corpus, by document. The function computeIDF computes the IDF score of every word in the corpus.

How to score the relative importance of words using tf-idf?

Another strategy is to score the relative importance of words using TF-IDF. The number of times a word appears in a document divded by the total number of words in the document. Every document has its own term frequency. The following code implements term frequency in python. The following lines compute the term frequency for each of our documents.

What happens if I don't store the TFIDF in the model?

If you were not storing the tfidf, you would have just used transform on the test data. Even when you are doing a transform there, the new documents from the test data are being "fit" to the vocabulary of the vectorizer of the train. That is exactly what we are doing here.

What is tf-idf in data science?

Instead, the words which are rare are the ones that actually help in distinguishing between the data, and carry more weight. TF-IDF stands for “Term Frequency — Inverse Data Frequency”. First, we will learn what this term means mathematically. Term Frequency (tf): gives us the frequency of the word in each document in the corpus.


3 Answers

I successfully saved the feature list by saving vectorizer.vocabulary_, and reuse by CountVectorizer(decode_error="replace",vocabulary=vectorizer.vocabulary_)

Codes below:

corpus = np.array(["aaa bbb ccc", "aaa bbb ddd"])
vectorizer = CountVectorizer(decode_error="replace")
vec_train = vectorizer.fit_transform(corpus)
#Save vectorizer.vocabulary_
pickle.dump(vectorizer.vocabulary_,open("feature.pkl","wb"))

#Load it later
transformer = TfidfTransformer()
loaded_vec = CountVectorizer(decode_error="replace",vocabulary=pickle.load(open("feature.pkl", "rb")))
tfidf = transformer.fit_transform(loaded_vec.fit_transform(np.array(["aaa ccc eee"])))

That works. tfidf will have same feature length as trained data.

like image 138
lol.Wen Avatar answered Sep 24 '22 06:09

lol.Wen


Instead of using the CountVectorizer for storing the vocabulary, the vocabulary of the tfidfvectorizer can be used directly.

Training phase:

from sklearn.feature_extraction.text import TfidfVectorizer

# tf-idf based vectors
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,2), stop_words = "english", lowercase = True, max_features = 500000)

# Fit the model
tf_transformer = tf.fit(corpus)

# Dump the file
pickle.dump(tf_transformer, open("tfidf1.pkl", "wb"))


# Testing phase
tf1 = pickle.load(open("tfidf1.pkl", 'rb'))

# Create new tfidfVectorizer with old vocabulary
tf1_new = TfidfVectorizer(analyzer='word', ngram_range=(1,2), stop_words = "english", lowercase = True,
                          max_features = 500000, vocabulary = tf1.vocabulary_)
X_tf1 = tf1_new.fit_transform(new_corpus)

The fit_transform works here as we are using the old vocabulary. If you were not storing the tfidf, you would have just used transform on the test data. Even when you are doing a transform there, the new documents from the test data are being "fit" to the vocabulary of the vectorizer of the train. That is exactly what we are doing here. The only thing we can store and re-use for a tfidf vectorizer is the vocabulary.

like image 16
Arjun Mishra Avatar answered Sep 24 '22 06:09

Arjun Mishra


If you want to store features list for testing data for use in future, you can do this:

tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))

#store the content
with open("x_result.pkl", 'wb') as handle:
                    pickle.dump(tfidf, handle)
#load the content
tfidf = pickle.load(open("x_result.pkl", "rb" ) )
like image 7
user123 Avatar answered Sep 20 '22 06:09

user123