I'm working on a corpus of ~100k research papers. I'm considering three fields:
I used the TfIdfVectorizer to get a TfIdf representation of the plaintext field and feed the thereby originated vocab back into the Vectorizers of title and abstract to assure that all three representations are working on the same vocab. My idea was that since the the plaintext field is much bigger than the other two, it's vocab will most probably cover all the words in the other fields. But how would the TfIdfVectorizer deal with new words/tokens if that wasn't the case?
Here's an example of my code:
vectorizer = TfidfVectorizer(min_df=2)
plaintexts_tfidf = vectorizer.fit_transform(plaintexts)
vocab = vectorizer.vocabulary_
# later in an another script after loading the vocab from disk
vectorizer = TfidfVectorizer(min_df=2, vocabulary=vocab)
titles_tfidf = vectorizer.fit_transform(titles)
The vocab has ~900k words.
During vectorization I didn't ran into any problems but later when I wanted to compare the similarity between the vectorized titles using sklearn.metrics.pairwise.cosine_similarity I ran into this error:
>> titles_sim = cosine_similarity(titles_tfidf)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-237-5aa86fe892da> in <module>()
----> 1 titles_sim = cosine_similarity(titles)
/usr/local/lib/python3.5/dist-packages/sklearn/metrics/pairwise.py in cosine_similarity(X, Y, dense_output)
916 Y_normalized = normalize(Y, copy=True)
917
--> 918 K = safe_sparse_dot(X_normalized, Y_normalized.T, dense_output=dense_output)
919
920 return K
/usr/local/lib/python3.5/dist-packages/sklearn/utils/extmath.py in safe_sparse_dot(a, b, dense_output)
184 ret = a * b
185 if dense_output and hasattr(ret, "toarray"):
--> 186 ret = ret.toarray()
187 return ret
188 else:
/usr/local/lib/python3.5/dist-packages/scipy/sparse/compressed.py in toarray(self, order, out)
918 def toarray(self, order=None, out=None):
919 """See the docstring for `spmatrix.toarray`."""
--> 920 return self.tocoo(copy=False).toarray(order=order, out=out)
921
922 ##############################################################
/usr/local/lib/python3.5/dist-packages/scipy/sparse/coo.py in toarray(self, order, out)
256 M,N = self.shape
257 coo_todense(M, N, self.nnz, self.row, self.col, self.data,
--> 258 B.ravel('A'), fortran)
259 return B
260
ValueError: could not convert integer scalar
I'm not really sure if it's related but I can't really see what's going wrong here. Also because I'm not running into the error when calculating the similarities on the plaintext vectors.
Am I missing something out? Is there a better way to use the Vectorizer?
Edit:
The shapes of the sparse csr_matrices are equal.
>> titles_tfidf.shape
(96582, 852885)
>> plaintexts_tfidf.shape
(96582, 852885)
TF-IDF vectorization involves calculating the TF-IDF score for every word in your corpus relative to that document and then putting that information into a vector (see image below using example documents “A” and “B”).
TF-IDF Vectorizer is a measure of originality of a word by comparing the number of times a word appears in document with the number of documents the word appears in. formula for TF-IDF is: TF-IDF = TF(t, d) x IDF(t), where, TF(t, d) = Number of times term "t" appears in a document "d".
From the way the TfIdf score is set up, there shouldn't be any significant difference in removing the stopwords. The whole point of the Idf is exactly to remove words with no semantic value from the corpus. If you do add the stopwords, the Idf should get rid of it.
TF-IDF is better than Count Vectorizers because it not only focuses on the frequency of words present in the corpus but also provides the importance of the words. We can then remove the words that are less important for analysis, hence making the model building less complex by reducing the input dimensions.
I'm afraid the matrix might be too large. It would be 96582*96582=9328082724 cells. Try to slice titles_tfidf a bit and check.
Source: http://scipy-user.10969.n7.nabble.com/SciPy-User-strange-error-when-creating-csr-matrix-td20129.html
EDT: If you are using older SciPy/Numpy version you might want to update: https://github.com/scipy/scipy/pull/4678
EDT2: Also if you are using 32bit python, switching to 64bit might help (I suppose)
EDT3:
Answering your original question. When you use vocabulary from plaintexts
and there will be new words in titles
they will be ignored - but not influence tfidf value. Hope this snippet may make it more understandable:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
plaintexts =["They are", "plain texts texts amoersand here"]
titles = ["And here", "titles ", "wolf dog eagle", "But here plain"]
vectorizer = TfidfVectorizer()
plaintexts_tfidf = vectorizer.fit_transform(plaintexts)
vocab = vectorizer.vocabulary_
vectorizer = TfidfVectorizer(vocabulary=vocab)
titles_tfidf = vectorizer.fit_transform(titles)
print('values using vocabulary')
print(titles_tfidf)
print(vectorizer.get_feature_names())
print('Brand new vectorizer')
vectorizer = TfidfVectorizer()
titles_tfidf = vectorizer.fit_transform(titles)
print(titles_tfidf)
print(vectorizer.get_feature_names())
Result is:
values using vocabulary
(0, 2) 1.0
(3, 3) 0.78528827571
(3, 2) 0.61913029649
['amoersand', 'are', 'here', 'plain', 'texts', 'they']
Brand new vectorizer
(0, 0) 0.78528827571
(0, 4) 0.61913029649
(1, 6) 1.0
(2, 7) 0.57735026919
(2, 2) 0.57735026919
(2, 3) 0.57735026919
(3, 4) 0.486934264074
(3, 1) 0.617614370976
(3, 5) 0.617614370976
['and', 'but', 'dog', 'eagle', 'here', 'plain', 'titles', 'wolf']
Notice it is not the same as I would remove words that not occur in plaintexts from titles.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With