I am trying to write a script where I will calculate the similarity of few documents. I want to do it by using LSA. I have found the following code and change it a bit. I has as an input 3 documents and then as output a 3x3 matrix with the similarity between them. I want to do the same similarity calculation but only with sklearn library. Is that possible?
from numpy import zeros
from scipy.linalg import svd
from math import log
from numpy import asarray, sum
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import cosine_similarity
titles = [doc1,doc2,doc3]
ignorechars = ''',:'!'''
class LSA(object):
def __init__(self, stopwords, ignorechars):
self.stopwords = stopwords.words('english')
self.ignorechars = ignorechars
self.wdict = {}
self.dcount = 0
def parse(self, doc):
words = doc.split();
for w in words:
w = w.lower()
if w in self.stopwords:
continue
elif w in self.wdict:
self.wdict[w].append(self.dcount)
else:
self.wdict[w] = [self.dcount]
self.dcount += 1
def build(self):
self.keys = [k for k in self.wdict.keys() if len(self.wdict[k]) > 1]
self.keys.sort()
self.A = zeros([len(self.keys), self.dcount])
for i, k in enumerate(self.keys):
for d in self.wdict[k]:
self.A[i,d] += 1
def calc(self):
self.U, self.S, self.Vt = svd(self.A)
return -1*self.Vt
def TFIDF(self):
WordsPerDoc = sum(self.A, axis=0)
DocsPerWord = sum(asarray(self.A > 0, 'i'), axis=1)
rows, cols = self.A.shape
for i in range(rows):
for j in range(cols):
self.A[i,j] = (self.A[i,j] / WordsPerDoc[j]) * log(float(cols) / DocsPerWord[i])
mylsa = LSA(stopwords, ignorechars)
for t in titles:
mylsa.parse(t)
mylsa.build()
a = mylsa.calc()
cosine_similarity(a)
From @ogrisel's answer:
I run the following code, but my mouth is still open :) When TFIDF has max 80% similarity on two documents with the same subject, this code give me 99.99%. That's why I think that it is something wrong :P
dataset = [doc1,doc2,doc3]
vectorizer = TfidfVectorizer(max_df=0.5,stop_words='english')
X = vectorizer.fit_transform(dataset)
lsa = TruncatedSVD()
X = lsa.fit_transform(X)
X = Normalizer(copy=False).fit_transform(X)
cosine_similarity(X)
You can use the TruncatedSVD transformer from sklearn 0.14+: you call it with fit_transform
on your database of documents and then call the transform
method (from the same TruncatedSVD
method) on the query document and then can compute the cosine similarity of the transformed query documents with the transformed database with the function: sklearn.metrics.pairwise.cosine_similarity
and numpy.argsort the result to find the index of most similar document.
Note that under the hood, scikit-learn also uses NumPy but in a more efficient way than the snippet you gave (by using the Randomized SVD trick by Halko, Martinsson and Tropp).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With