TfIdfVectorizer: How does the vectorizer with fixed vocab deal with new words?

Tags:

I'm working on a corpus of ~100k research papers. I'm considering three fields:

plaintext
title
abstract

I used the TfIdfVectorizer to get a TfIdf representation of the plaintext field and feed the thereby originated vocab back into the Vectorizers of title and abstract to assure that all three representations are working on the same vocab. My idea was that since the the plaintext field is much bigger than the other two, it's vocab will most probably cover all the words in the other fields. But how would the TfIdfVectorizer deal with new words/tokens if that wasn't the case?

Here's an example of my code:

vectorizer = TfidfVectorizer(min_df=2)
plaintexts_tfidf = vectorizer.fit_transform(plaintexts)
vocab = vectorizer.vocabulary_
# later in an another script after loading the vocab from disk
vectorizer = TfidfVectorizer(min_df=2, vocabulary=vocab)
titles_tfidf = vectorizer.fit_transform(titles)

The vocab has ~900k words.

During vectorization I didn't ran into any problems but later when I wanted to compare the similarity between the vectorized titles using sklearn.metrics.pairwise.cosine_similarity I ran into this error:

>> titles_sim = cosine_similarity(titles_tfidf)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-237-5aa86fe892da> in <module>()
----> 1 titles_sim = cosine_similarity(titles)

/usr/local/lib/python3.5/dist-packages/sklearn/metrics/pairwise.py in cosine_similarity(X, Y, dense_output)
    916         Y_normalized = normalize(Y, copy=True)
    917 
--> 918     K = safe_sparse_dot(X_normalized, Y_normalized.T, dense_output=dense_output)
    919 
    920     return K

/usr/local/lib/python3.5/dist-packages/sklearn/utils/extmath.py in safe_sparse_dot(a, b, dense_output)
    184         ret = a * b
    185         if dense_output and hasattr(ret, "toarray"):
--> 186             ret = ret.toarray()
    187         return ret
    188     else:

/usr/local/lib/python3.5/dist-packages/scipy/sparse/compressed.py in toarray(self, order, out)
    918     def toarray(self, order=None, out=None):
    919         """See the docstring for `spmatrix.toarray`."""
--> 920         return self.tocoo(copy=False).toarray(order=order, out=out)
    921 
    922     ##############################################################

/usr/local/lib/python3.5/dist-packages/scipy/sparse/coo.py in toarray(self, order, out)
    256         M,N = self.shape
    257         coo_todense(M, N, self.nnz, self.row, self.col, self.data,
--> 258                     B.ravel('A'), fortran)
    259         return B
    260 

ValueError: could not convert integer scalar

I'm not really sure if it's related but I can't really see what's going wrong here. Also because I'm not running into the error when calculating the similarities on the plaintext vectors.

Am I missing something out? Is there a better way to use the Vectorizer?

Edit:

The shapes of the sparse csr_matrices are equal.

>> titles_tfidf.shape
(96582, 852885)
>> plaintexts_tfidf.shape
(96582, 852885)

451

asked Feb 06 '17 13:02

nadre

1 Answers

I'm afraid the matrix might be too large. It would be 96582*96582=9328082724 cells. Try to slice titles_tfidf a bit and check.

Source: http://scipy-user.10969.n7.nabble.com/SciPy-User-strange-error-when-creating-csr-matrix-td20129.html

EDT: If you are using older SciPy/Numpy version you might want to update: https://github.com/scipy/scipy/pull/4678

EDT2: Also if you are using 32bit python, switching to 64bit might help (I suppose)

EDT3: Answering your original question. When you use vocabulary from plaintexts and there will be new words in titles they will be ignored - but not influence tfidf value. Hope this snippet may make it more understandable:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

plaintexts =["They are", "plain texts texts amoersand here"]
titles = ["And here", "titles ", "wolf dog eagle", "But here plain"]

vectorizer = TfidfVectorizer()
plaintexts_tfidf = vectorizer.fit_transform(plaintexts)
vocab = vectorizer.vocabulary_
vectorizer = TfidfVectorizer(vocabulary=vocab)
titles_tfidf = vectorizer.fit_transform(titles)
print('values using vocabulary')
print(titles_tfidf)
print(vectorizer.get_feature_names())
print('Brand new vectorizer')
vectorizer = TfidfVectorizer()
titles_tfidf = vectorizer.fit_transform(titles)
print(titles_tfidf)
print(vectorizer.get_feature_names())

Result is:

values using vocabulary
  (0, 2)        1.0
  (3, 3)        0.78528827571
  (3, 2)        0.61913029649
['amoersand', 'are', 'here', 'plain', 'texts', 'they']
Brand new vectorizer
  (0, 0)        0.78528827571
  (0, 4)        0.61913029649
  (1, 6)        1.0
  (2, 7)        0.57735026919
  (2, 2)        0.57735026919
  (2, 3)        0.57735026919
  (3, 4)        0.486934264074
  (3, 1)        0.617614370976
  (3, 5)        0.617614370976
['and', 'but', 'dog', 'eagle', 'here', 'plain', 'titles', 'wolf']

Notice it is not the same as I would remove words that not occur in plaintexts from titles.

answered Sep 22 '22 04:09

mbednarski

Related questions
                            
                                Compute mean squared, absolute deviation and custom similarity measure - Python/NumPy
                            
                                python: pandas: filter one column and get the average of another column
                            
                                Convert string date time to pandas datetime
                            
                                use apt-get install python packages in .gitlab-ci.yml
                            
                                Rendering a float array to 24-bit RGB image (using PIL for example)
                            
                                Use pm2 with Django
                            
                                Overiding __mul__ in two dimensional vector class to preserve commutivity
                            
                                Can't Find Jupyter Notebook Kernel
                            
                                I'm trying to count all letters in a txt file then display in descending order
                            
                                Fill in a blank dataframe column with all 0 values using Python
                            
                                createsuperuser didn't ask for username
                            
                                Draw Circles on Top Level of Figure
                            
                                Change queryset in ModelViewSet in Django Rest Framework
                            
                                subplots from a multiindex pandas dataframe grouped by level
                            
                                return list of ids in a Get. Django Rest Framework
                            
                                How to pass a class based task into CELERY_BEAT_SCHEDULE
                            
                                Python Tools Visual Studio 2017 RC
                            
                                Return Tuple of Index and .max() Value?
                            
                                Mandelbrot set displays incorrectly
                            
                                jinja2 link to static files

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

TfIdfVectorizer: How does the vectorizer with fixed vocab deal with new words?

Tags:

python

scikit-learn

cosine-similarity

tf-idf

nadre

People also ask

1 Answers

mbednarski

Recent Activity

Donate For Us