Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

tfidf algorithm for python

I have this code for calculating text similarity with tf-idf.

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [doc1,doc2]
tfidf = TfidfVectorizer().fit_transform(documents)
pairwise_similarity = tfidf * tfidf.T
print pairwise_similarity.A

The problem is that this code take as an input plain strings and I want to prepare the documents by removing stopwords, stemming and tokkenize. So the input would be a list. The error if I call the documents = [doc1,doc2] with the tokkenized documents is:

    Traceback (most recent call last):
  File "C:\Users\tasos\Desktop\my thesis\beta\similarity.py", line 18, in <module>
    tfidf = TfidfVectorizer().fit_transform(documents)
  File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\feature_extraction\text.py", line 1219, in fit_transform
    X = super(TfidfVectorizer, self).fit_transform(raw_documents)
  File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\feature_extraction\text.py", line 780, in fit_transform
    vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
  File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\feature_extraction\text.py", line 715, in _count_vocab
    for feature in analyze(doc):
  File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\feature_extraction\text.py", line 229, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\feature_extraction\text.py", line 195, in <lambda>
    return lambda x: strip_accents(x.lower())
AttributeError: 'unicode' object has no attribute 'apply_freq_filter'

Is there any way to change the code and make it accept list or have I to change the tokkenized documents to strings again?

like image 748
Tasos Avatar asked Aug 25 '13 18:08

Tasos


1 Answers

Try skipping preprocessing to lowercase and supply your own "nop" tokenizer:

tfidf = TfidfVectorizer(tokenizer=lambda doc: doc, lowercase=False).fit_transform(documents)

You should also check out other parameters like stop_words to avoid duplicating your preprocessing.

like image 193
chlunde Avatar answered Sep 29 '22 15:09

chlunde