Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Semantic similarity score for Strings [duplicate]

Are there any libraries for computing semantic similarity scores for a pair of sentences ?

I'm aware of WordNet's semantic database, and how I can generate the score for 2 words, but I'm looking for libraries that do all pre-processing tasks like port-stemming, stop word removal, etc, on whole sentences and outputs a score for how related the two sentences are.

I found a work in progress that's written using the .NET framework that computes the score using an array of pre-processing steps. Is there any project that does this in python?

I'm not looking for the sequence of operations that would help me find the score (as is asked for here)
I'd love to implement each stage on my own, or glue functions from different libraries so that it works for sentence pairs, but I need this mostly as a tool to test inferences on data.


EDIT: I was considering using NLTK and computing the score for every pair of words iterated over the two sentences, and then draw inferences from the standard deviation of the results, but I don't know if that's a legitimate estimate of similarity. Plus, that'll take a LOT of time for long strings.
Again, I'm looking for projects/libraries that already implement this intelligently. Something that lets me do this:

import amazing_semsim_package str1='Birthday party ruined as cake explodes' str2='Grandma mistakenly bakes cake using gunpowder'  >>similarity(str1,str2) >>0.889 
like image 392
user8472 Avatar asked Jun 10 '13 11:06

user8472


People also ask

How do you check if two strings are similar in Python?

The simplest way to check if two strings are equal in Python is to use the == operator. And if you are looking for the opposite, then != is what you need. That's it!

How do you find the similarity of two strings?

The way to check the similarity between any data point or groups is by calculating the distance between those data points. In textual data as well, we check the similarity between the strings by calculating the distance between one text to another text.

How do you measure semantic similarity between words?

To calculate the semantic similarity between words and sentences, the proposed method follows an edge-based approach using a lexical database. The methodology can be applied in a variety of domains. The methodology has been tested on both benchmark standards and mean human similarity dataset.


1 Answers

The best package I've seen for this is Gensim, found at the Gensim Homepage. I've used it many times, and overall been very happy with it's ease of use; it is written in Python, and has an easy to follow tutorial to get you started, which compares 9 strings. It can be installed via pip, so you won't have a lot of hassle getting it installed I hope.

Which scoring algorithm you use depends heavily on the context of your problem, but I'd suggest starting of with the LSI functionality if you want something basic. (That's what the tutorial walks you through.)

If you go through the tutorial for gensim, it will walk you through comparing two strings, using the Similarities function. This will allow you to see how your stings compare to each other, or to some other sting, on the basis of the text they contain.

If you're interested in the science behind how it works, check out this paper.

like image 61
Justin Muller Avatar answered Oct 12 '22 09:10

Justin Muller