I need to compare documents stored in a DB and come up with a similarity score between 0 and 1.
The method I need to use has to be very simple. Implementing a vanilla version of n-grams (where it possible to define how many grams to use), along with a simple implementation of tf-idf and Cosine similarity.
Is there any program that can do this? Or should I start writing this from scratch?
We use the below formula to compute the cosine similarity. where A and B are vectors: A.B is dot product of A and B: It is computed as sum of element-wise product of A and B. ||A|| is L2 norm of A: It is computed as square root of the sum of squares of elements of the vector A.
Tf-idf is a transformation you apply to texts to get two real-valued vectors. You can then obtain the cosine similarity of any pair of vectors by taking their dot product and dividing that by the product of their norms. That yields the cosine of the angle between the vectors.
TF-IDF will give you a representation for a given term in a document. Cosine similarity will give you a score for two different documents that share the same representation. However, "one of the simplest ranking functions is computed by summing the tf–idf for each query term".
Wikipedia defines an N-Gram as “A contiguous sequence of N items from a given sample of text or speech”. Here an item can be a character, a word or a sentence and N can be any integer. When N is 2, we call the sequence a bigram. Similarly, a sequence of 3 items is called a trigram, and so on.
Check out NLTK package: http://www.nltk.org it has everything what you need
For the cosine_similarity:
def cosine_distance(u, v): """ Returns the cosine of the angle between vectors v and u. This is equal to u.v / |u||v|. """ return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v)))
For ngrams:
def ngrams(sequence, n, pad_left=False, pad_right=False, pad_symbol=None): """ A utility that produces a sequence of ngrams from a sequence of items. For example: >>> ngrams([1,2,3,4,5], 3) [(1, 2, 3), (2, 3, 4), (3, 4, 5)] Use ingram for an iterator version of this function. Set pad_left or pad_right to true in order to get additional ngrams: >>> ngrams([1,2,3,4,5], 2, pad_right=True) [(1, 2), (2, 3), (3, 4), (4, 5), (5, None)] @param sequence: the source data to be converted into ngrams @type sequence: C{sequence} or C{iterator} @param n: the degree of the ngrams @type n: C{int} @param pad_left: whether the ngrams should be left-padded @type pad_left: C{boolean} @param pad_right: whether the ngrams should be right-padded @type pad_right: C{boolean} @param pad_symbol: the symbol to use for padding (default is None) @type pad_symbol: C{any} @return: The ngrams @rtype: C{list} of C{tuple}s """ if pad_left: sequence = chain((pad_symbol,) * (n-1), sequence) if pad_right: sequence = chain(sequence, (pad_symbol,) * (n-1)) sequence = list(sequence) count = max(0, len(sequence) - n + 1) return [tuple(sequence[i:i+n]) for i in range(count)]
for tf-idf you will have to compute distribution first, I am using Lucene to do that, but you may very well do something similar with NLTK, use FreqDist:
http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html#frequency_distribution_index_term
if you like pylucene, this will tell you how to comute tf.idf
# reader = lucene.IndexReader(FSDirectory.open(index_loc)) docs = reader.numDocs() for i in xrange(docs): tfv = reader.getTermFreqVector(i, fieldname) if tfv: rec = {} terms = tfv.getTerms() frequencies = tfv.getTermFrequencies() for (t,f,x) in zip(terms,frequencies,xrange(maxtokensperdoc)): df= searcher.docFreq(Term(fieldname, t)) # number of docs with the given term tmap.setdefault(t, len(tmap)) rec[t] = sim.tf(f) * sim.idf(df, max_doc) #compute TF.IDF # and normalize the values using cosine normalization if cosine_normalization: denom = sum([x**2 for x in rec.values()])**0.5 for k,v in rec.items(): rec[k] = v / denom
If you are interested, I've done tutorial series (Part I and Part II) talking about tf-idf and using the Scikits.learn (sklearn) Python module.
Part 3 has cosine similarity.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With