When entering a question, stackoverflow presents you with a list of questions that it thinks likely to cover the same topic. I have seen similar features on other sites or in other programs, too (Help file systems, for example), but I've never programmed something like this myself. Now I'm curious to know what sort of algorithm one would use for that.
The first approach that comes to my mind is splitting the phrase into words and look for phrases containing these words. Before you do that, you probably want to throw away insignificant words (like 'the', 'a', 'does' etc), and then you will want to rank the results.
Hey, wait - let's do that for web pages, and then we can have a ... watchamacallit ... - a "search engine", and then we can sell ads, and then ...
No, seriously, what are the common ways to solve this problem?
like, similar to, also, unlike, similarly, in the same way, likewise, again, compared to, in contrast, in like manner, contrasted with, on the contrary, however, although, yet, even though, still, but, nevertheless, conversely, at the same time, regardless, despite, while, on the one hand … on the other hand.
in comparison phrase. used for talking about the ways in which two people or things are different.
One approach is the so called bag-of-words model.
As you guessed, first you count how many times words appear in the text (usually called document in the NLP-lingo). Then you throw out the so called stop words, such as "the", "a", "or" and so on.
You're left with words and word counts. Do this for a while and you get a comprehensive set of words that appear in your documents. You can then create an index for these words: "aardvark" is 1, "apple" is 2, ..., "z-index" is 70092.
Now you can take your word bags and turn them into vectors. For example, if your document contains two references for aardvarks and nothing else, it would look like this:
[2 0 0 ... 70k zeroes ... 0].
After this you can count the "angle" between the two vectors with a dot product. The smaller the angle, the closer the documents are.
This is a simple version and there other more advanced techniques. May the Wikipedia be with you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With