Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I compare phrases for similarity?

When entering a question, stackoverflow presents you with a list of questions that it thinks likely to cover the same topic. I have seen similar features on other sites or in other programs, too (Help file systems, for example), but I've never programmed something like this myself. Now I'm curious to know what sort of algorithm one would use for that.

The first approach that comes to my mind is splitting the phrase into words and look for phrases containing these words. Before you do that, you probably want to throw away insignificant words (like 'the', 'a', 'does' etc), and then you will want to rank the results.

Hey, wait - let's do that for web pages, and then we can have a ... watchamacallit ... - a "search engine", and then we can sell ads, and then ...

No, seriously, what are the common ways to solve this problem?

like image 614
Hanno Fietz Avatar asked Sep 16 '08 09:09

Hanno Fietz


People also ask

How do you compare phrases?

like, similar to, also, unlike, similarly, in the same way, likewise, again, compared to, in contrast, in like manner, contrasted with, on the contrary, however, although, yet, even though, still, but, nevertheless, conversely, at the same time, regardless, despite, while, on the one hand … on the other hand.

What is a Comparing phrase?

in comparison phrase. used for talking about the ways in which two people or things are different.


1 Answers

One approach is the so called bag-of-words model.

As you guessed, first you count how many times words appear in the text (usually called document in the NLP-lingo). Then you throw out the so called stop words, such as "the", "a", "or" and so on.

You're left with words and word counts. Do this for a while and you get a comprehensive set of words that appear in your documents. You can then create an index for these words: "aardvark" is 1, "apple" is 2, ..., "z-index" is 70092.

Now you can take your word bags and turn them into vectors. For example, if your document contains two references for aardvarks and nothing else, it would look like this:

[2 0 0 ... 70k zeroes ... 0].

After this you can count the "angle" between the two vectors with a dot product. The smaller the angle, the closer the documents are.

This is a simple version and there other more advanced techniques. May the Wikipedia be with you.

like image 98
Antti Rasinen Avatar answered Oct 11 '22 07:10

Antti Rasinen