Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use Latent Semantic Analysis with sklearn

I am trying to write a script where I will calculate the similarity of few documents. I want to do it by using LSA. I have found the following code and change it a bit. I has as an input 3 documents and then as output a 3x3 matrix with the similarity between them. I want to do the same similarity calculation but only with sklearn library. Is that possible?

from numpy import zeros
from scipy.linalg import svd
from math import log
from numpy import asarray, sum
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import cosine_similarity

titles = [doc1,doc2,doc3]
ignorechars = ''',:'!'''

class LSA(object):
    def __init__(self, stopwords, ignorechars):
        self.stopwords = stopwords.words('english')
        self.ignorechars = ignorechars
        self.wdict = {}
        self.dcount = 0        
    def parse(self, doc):
        words = doc.split();
        for w in words:
            w = w.lower()
            if w in self.stopwords:
                continue
            elif w in self.wdict:
                self.wdict[w].append(self.dcount)
            else:
                self.wdict[w] = [self.dcount]
        self.dcount += 1
    def build(self):
        self.keys = [k for k in self.wdict.keys() if len(self.wdict[k]) > 1]
        self.keys.sort()
        self.A = zeros([len(self.keys), self.dcount])
        for i, k in enumerate(self.keys):
            for d in self.wdict[k]:
                self.A[i,d] += 1
    def calc(self):
        self.U, self.S, self.Vt = svd(self.A)
        return -1*self.Vt

    def TFIDF(self):
        WordsPerDoc = sum(self.A, axis=0)        
        DocsPerWord = sum(asarray(self.A > 0, 'i'), axis=1)
        rows, cols = self.A.shape
        for i in range(rows):
            for j in range(cols):
                self.A[i,j] = (self.A[i,j] / WordsPerDoc[j]) * log(float(cols) / DocsPerWord[i])

mylsa = LSA(stopwords, ignorechars)
for t in titles:
    mylsa.parse(t)
mylsa.build()
a = mylsa.calc()
cosine_similarity(a)

From @ogrisel's answer:

I run the following code, but my mouth is still open :) When TFIDF has max 80% similarity on two documents with the same subject, this code give me 99.99%. That's why I think that it is something wrong :P

dataset = [doc1,doc2,doc3]
vectorizer = TfidfVectorizer(max_df=0.5,stop_words='english')
X = vectorizer.fit_transform(dataset)
lsa = TruncatedSVD()
X = lsa.fit_transform(X)
X = Normalizer(copy=False).fit_transform(X)

cosine_similarity(X)
like image 216
Tasos Avatar asked Sep 25 '13 06:09

Tasos


1 Answers

You can use the TruncatedSVD transformer from sklearn 0.14+: you call it with fit_transform on your database of documents and then call the transform method (from the same TruncatedSVD method) on the query document and then can compute the cosine similarity of the transformed query documents with the transformed database with the function: sklearn.metrics.pairwise.cosine_similarity and numpy.argsort the result to find the index of most similar document.

Note that under the hood, scikit-learn also uses NumPy but in a more efficient way than the snippet you gave (by using the Randomized SVD trick by Halko, Martinsson and Tropp).

like image 166
ogrisel Avatar answered Sep 20 '22 01:09

ogrisel