Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Building or Finding a "relevant terms" suggestion feature

Given a few words of input, I want to have a utility that will return a diverse set of relevant terms, phrases, or concepts. A caveat is that it would need to have a large graph of terms to begin with, or else the feature would not be very useful.

For example, submitting "baseball" would return

["shortstop", "Babe Ruth", "foul ball", "steroids", ... ]

Google Sets is the best example I can find of this kind of feature, but I can't use it since they have no public API (and I wont go against their TOS). Also, single-word input doesn't garner a very diverse set of results. I'm looking for a solution that goes off on tangents.

The closest I've experimented with is using WikiPedia's API to search Categories and Backlinks, but there's no way to directly sort those results by "relevance" or "popularity". Without that, the suggestion list is massive and all over the place, which is not immediately useful and very hard to whittle down.

Using A Thesaurus could also work minimally, but that would leave out any proper nouns or tangentially relevant terms (like any of the results listed above).


I would happily reuse an open service, if one exists, but I haven't found anything sufficient.

I'm looking for either a way to implement this either in-house with a decently-populated starting set, or reuse a free service that offers this.

Have a solution? Thanks ahead of time!


UPDATE: Thank you for the incredibly dense & informative answers. I'll choose a winning answer in 6 to 12 months, when I'll hopefully understand what you've all suggested =)

like image 782
drfloob Avatar asked Feb 21 '09 01:02

drfloob


3 Answers

You might be interested in WordNet. It takes a bit of linguistic knowledge to understand the API, but basically the system is a database of meaning-based links between English words, which is more or less what you're searching for. I'm sure I can dig up more information if you want it.

like image 57
David Z Avatar answered Nov 17 '22 22:11

David Z


Peter Norvig (director of research at Google) spoke about how they do this at Google (specifically mentioning Google Sets) in a Facebook Tech Talk. The idea is that a relatively simple algorithm on a huge dataset (e.g. the entire web) is much better than a complicated algorithm on a small data set.

You could look at Google's n-gram collection as a starting point. You'd start to see what concepts are grouped together. Norvig hinted that internally Google has up to 7-grams for use in things like Google Translate.

If you're more ambitious, you could download all of Wikipedia's articles in the language you desire and create your own n-gram database.

The problem is even more complicated if you just have a single word; check out this recent thesis for more details on word sense disambiguation.

It's not an easy problem, but it is useful as you mentioned. In the end, I think you'll find that a really successful implementation will have a relatively simple algorithm and a whole lot of data.

like image 34
Jeff Moser Avatar answered Nov 17 '22 23:11

Jeff Moser


Take a look at the following two papers:

  • Clustering User Queries of a Search Engine [pdf]
  • Topic Detection by Clustering Keywords [pdf]
  • Here is my attempt at a very simplified explanation:

    If we have a database of past user queries, we can define a similarity function between two queries. For example: number of words in common. Now for each query in our database, we compute its similarity with each other query, and remember the k most similar queries. The non-overlapping words from these can be returned as "related terms".

    We can also take this approach with a database of documents containing information users might be searching for. We can define the similarity between two search terms as the number of documents containing both divided by the number of documents containing either. To decide which terms to test, we can scan the documents and throw out words that are either too common ('and', 'the', etc.) or that are too obscure.

    If our data permits, then we could see which queries led users to choosing which results, instead of comparing documents by content. For example if we had data that showed us that users searching for "Celtics" and "Lakers" both ended up clicking on espn.com, then we could call these related terms.

    If you're starting from scratch with no data about past user queries, then you can try Wikipedia, or the Bag of Words dataset as a database of documents. If you are looking for a database of user search terms and results, and if you are feeling adventurous, then you can take a look at the AOL Search Data.

    like image 41
    Imran Avatar answered Nov 17 '22 22:11

    Imran