Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Simple implementation of N-Gram, tf-idf and Cosine similarity in Python

I need to compare documents stored in a DB and come up with a similarity score between 0 and 1.

The method I need to use has to be very simple. Implementing a vanilla version of n-grams (where it possible to define how many grams to use), along with a simple implementation of tf-idf and Cosine similarity.

Is there any program that can do this? Or should I start writing this from scratch?

like image 889
seanieb Avatar asked Mar 04 '10 15:03

seanieb


People also ask

How do you implement cosine similarity in Python?

We use the below formula to compute the cosine similarity. where A and B are vectors: A.B is dot product of A and B: It is computed as sum of element-wise product of A and B. ||A|| is L2 norm of A: It is computed as square root of the sum of squares of elements of the vector A.

How do you find cosine similarity using TF-IDF?

Tf-idf is a transformation you apply to texts to get two real-valued vectors. You can then obtain the cosine similarity of any pair of vectors by taking their dot product and dividing that by the product of their norms. That yields the cosine of the angle between the vectors.

What is TF-IDF cosine?

TF-IDF will give you a representation for a given term in a document. Cosine similarity will give you a score for two different documents that share the same representation. However, "one of the simplest ranking functions is computed by summing the tf–idf for each query term".

What is N-gram in TF-IDF?

Wikipedia defines an N-Gram as “A contiguous sequence of N items from a given sample of text or speech”. Here an item can be a character, a word or a sentence and N can be any integer. When N is 2, we call the sequence a bigram. Similarly, a sequence of 3 items is called a trigram, and so on.


2 Answers

Check out NLTK package: http://www.nltk.org it has everything what you need

For the cosine_similarity:

 def cosine_distance(u, v):     """     Returns the cosine of the angle between vectors v and u. This is equal to     u.v / |u||v|.     """     return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v)))  

For ngrams:

 def ngrams(sequence, n, pad_left=False, pad_right=False, pad_symbol=None):     """     A utility that produces a sequence of ngrams from a sequence of items.     For example:      >>> ngrams([1,2,3,4,5], 3)     [(1, 2, 3), (2, 3, 4), (3, 4, 5)]      Use ingram for an iterator version of this function.  Set pad_left     or pad_right to true in order to get additional ngrams:      >>> ngrams([1,2,3,4,5], 2, pad_right=True)     [(1, 2), (2, 3), (3, 4), (4, 5), (5, None)]      @param sequence: the source data to be converted into ngrams     @type sequence: C{sequence} or C{iterator}     @param n: the degree of the ngrams     @type n: C{int}     @param pad_left: whether the ngrams should be left-padded     @type pad_left: C{boolean}     @param pad_right: whether the ngrams should be right-padded     @type pad_right: C{boolean}     @param pad_symbol: the symbol to use for padding (default is None)     @type pad_symbol: C{any}     @return: The ngrams     @rtype: C{list} of C{tuple}s     """      if pad_left:         sequence = chain((pad_symbol,) * (n-1), sequence)     if pad_right:         sequence = chain(sequence, (pad_symbol,) * (n-1))     sequence = list(sequence)      count = max(0, len(sequence) - n + 1)     return [tuple(sequence[i:i+n]) for i in range(count)]  

for tf-idf you will have to compute distribution first, I am using Lucene to do that, but you may very well do something similar with NLTK, use FreqDist:

http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html#frequency_distribution_index_term

if you like pylucene, this will tell you how to comute tf.idf

    # reader = lucene.IndexReader(FSDirectory.open(index_loc))     docs = reader.numDocs()     for i in xrange(docs):         tfv = reader.getTermFreqVector(i, fieldname)         if tfv:             rec = {}             terms = tfv.getTerms()             frequencies = tfv.getTermFrequencies()             for (t,f,x) in zip(terms,frequencies,xrange(maxtokensperdoc)):                     df= searcher.docFreq(Term(fieldname, t)) # number of docs with the given term                         tmap.setdefault(t, len(tmap))                         rec[t] = sim.tf(f) * sim.idf(df, max_doc)  #compute TF.IDF             # and normalize the values using cosine normalization             if cosine_normalization:                 denom = sum([x**2 for x in rec.values()])**0.5                 for k,v in rec.items():                     rec[k] = v / denom 
like image 69
roman Avatar answered Oct 02 '22 02:10

roman


If you are interested, I've done tutorial series (Part I and Part II) talking about tf-idf and using the Scikits.learn (sklearn) Python module.

Part 3 has cosine similarity.

like image 44
Tarantula Avatar answered Oct 02 '22 02:10

Tarantula