Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you implement a "Did you mean"? [duplicate]

Tags:

nlp

People also ask

How does the Google Did you mean algorithm work?

According to Google, the search feature follows a pre-determined process : a query is initiated, the web is navigated by following links from one page to another (an operation termed web crawling), applicable pages are sorted using a set of criteria and indexes, and the most suitable results delivered which are ...

Did you mean in Python?

didYouMean is Python function that you can use to correct spelling mistakes that users make, and the words arent available in any of the Dictionaries. didYouMean makes use of Google's "Did You Mean" feature.

Did you mean searching algorithm?

The Did You Mean algorithm works separately on every term within the query. If a search query returns less than 50 results, the Did You Mean algorithm performs the following: Primo searches for a match in the Did You Mean index. Several candidates are checked and the highest-ranking result is used.


Actually what Google does is very much non-trivial and also at first counter-intuitive. They don't do anything like check against a dictionary, but rather they make use of statistics to identify "similar" queries that returned more results than your query, the exact algorithm is of course not known.

There are different sub-problems to solve here, as a fundamental basis for all Natural Language Processing statistics related there is one must have book: Foundation of Statistical Natural Language Processing.

Concretely to solve the problem of word/query similarity I have had good results with using Edit Distance, a mathematical measure of string similarity that works surprisingly well. I used to use Levenshtein but the others may be worth looking into.

Soundex - in my experience - is crap.

Actually efficiently storing and searching a large dictionary of misspelled words and having sub second retrieval is again non-trivial, your best bet is to make use of existing full text indexing and retrieval engines (i.e. not your database's one), of which Lucene is currently one of the best and coincidentally ported to many many platforms.


Google's Dr Norvig has outlined how it works; he even gives a 20ish line Python implementation:

http://googlesystem.blogspot.com/2007/04/simplified-version-of-googles-spell.html

http://www.norvig.com/spell-correct.html

Dr Norvig also discusses the "did you mean" in this excellent talk. Dr Norvig is head of research at Google - when asked how "did you mean" is implemented, his answer is authoritive.

So its spell-checking, presumably with a dynamic dictionary build from other searches or even actual internet phrases and such. But that's still spell checking.

SOUNDEX and other guesses don't get a look in, people!


Check this article on wikipedia about the Levenshtein distance. Make sure you take a good look at Possible improvements.


I was pleasantly surprised that someone has asked how to create a state-of-the-art spelling suggestion system for search engines. I have been working on this subject for more than a year for a search engine company and I can point to information on the public domain on the subject.

As was mentioned in a previous post, Google (and Microsoft and Yahoo!) do not use any predefined dictionary nor do they employ hordes of linguists that ponder over the possible misspellings of queries. That would be impossible due to the scale of the problem but also because it is not clear that people could actually correctly identify when and if a query is misspelled.

Instead there is a simple and rather effective principle that is also valid for all European languages. Get all the unique queries on your search logs, calculate the edit distance between all pairs of queries, assuming that the reference query is the one that has the highest count.

This simple algorithm will work great for many types of queries. If you want to take it to the next level then I suggest you read the paper by Microsoft Research on that subject. You can find it here

The paper has a great introduction but after that you will need to be knowledgeable with concepts such as the Hidden Markov Model.


I would suggest looking at SOUNDEX to find similar words in your database.

You can also access google own dictionary by using the Google API spelling suggestion request.


You may want to look at Peter Norvig's "How to Write a Spelling Corrector" article.


I believe Google logs all queries and identifies when someone makes a spelling correction. This correction may then be suggested when others supply the same first query. This will work for any language, in fact any string of any characters.