Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I approximate "Did you mean?" without using Google?

Tags:

I am aware of the duplicates of this question:

  • How does the Google “Did you mean?” Algorithm work?
  • How do you implement a “Did you mean”?
  • ... and many others.

These questions are interested in how the algorithm actually works. My question is more like: Let's assume Google did not exist or maybe this feature did not exist and we don't have user input. How does one go about implementing an approximate version of this algorithm?

Why is this interesting?

Ok. Try typing "qualfy" into Google and it tells you:

Did you mean: qualify

Fair enough. It uses Statistical Machine Learning on data collected from billions of users to do this. But now try typing this: "Trytoreconnectyou" into Google and it tells you:

Did you mean: Try To Reconnect You

Now this is the more interesting part. How does Google determine this? Have a dictionary handy and guess the most probably words again using user input? And how does it differentiate between a misspelled word and a sentence?

Now considering that most programmers do not have access to input from billions of users, I am looking for the best approximate way to implement this algorithm and what resources are available (datasets, libraries etc.). Any suggestions?

like image 430
Legend Avatar asked Mar 10 '11 20:03

Legend


People also ask

Did you mean by Google?

Google's search engine includes a feature now familiar to many web users - "Did you mean" - which provides alternative suggestions when you may have misspelled a search term.

How does Google's Did you mean work?

According to Google, the search feature follows a pre-determined process : a query is initiated, the web is navigated by following links from one page to another (an operation termed web crawling), applicable pages are sorted using a set of criteria and indexes, and the most suitable results delivered which are ...

Did you mean searching algorithm?

The Did You Mean algorithm works separately on every term within the query. If a search query returns less than 50 results, the Did You Mean algorithm performs the following: Primo searches for a match in the Did You Mean index. Several candidates are checked and the highest-ranking result is used.

How does Google detect typo?

Google figures out possible misspellings and their likely correct spellings by using words it finds while searching the web and processing user queries.

What is Did you mean in Google search?

Google's search engine includes a feature now familiar to many web users - "Did you mean" - which provides alternative suggestions when you may have misspelled a search term. Text search is common in a variety of applications including many eCommerce web sites where it is typically used to allow customers to search the product catalogue ...

How does Google’s algorithm work?

The algorithm that Google uses to guess the desired query by a user who mistypes the search terms functions quite well. Let’s suppose the user searches for a word on Google but inadvertently misspells it: In that case, the search doesn’t show the literal query typed by the user.

How does Google know who corrects the word'Nigth'?

Also this means if overnight everyone start to spell night as "nigth" google would suggest that word instead. @ThomasRutter: Douglas describe it as "statistical machine learning". They know who correct the query, because they know which query comes from which user ( using cookies )

How to write a Google review without a Gmail account?

2 Simple Steps to Write a Google Review Without a Gmail Account. 1 Step 1. Search for the business you wish to review. Begin by Googling the business you wish to review. 2 Step 2. Write A Review. 3 Final Thoughts on Google Reviews.


2 Answers

Assuming you have a dictionary of words (all the words that appear in the dictionary in the worst case, all the phrases that appear in the data in your system in the best case) and that you know the relative frequency of the various words, you should be able to reasonably guess at what the user meant via some combination of the similarity of the word and the number of hits for the similar word. The weights obviously require a bit of trial and error, but generally the user will be more interested in a popular result that is a bit linguistically further away from the string they entered than in a valid word that is linguistically closer but only has one or two hits in your system.

The second case should be a bit more straightforward. You find all the valid words that begin the string ("T" is invalid, "Tr" is invalid, "Try" is a word, "Tryt" is not a word, etc.) and for each valid word, you repeat the algorithm for the remaining string. This should be pretty quick assuming your dictionary is indexed. If you find a result where you are able to decompose the long string into a set of valid words with no remaining characters, that's what you recommend. Of course, if you're Google, you probably modify the algorithm to look for substrings that are reasonably close typos to actual words and you have some logic to handle cases where a string can be read multiple ways with a loose enough spellcheck (possibly using the number of results to break the tie).

like image 118
Justin Cave Avatar answered Oct 21 '22 18:10

Justin Cave


From the horse's mouth: How to Write a Spelling Corrector

The interesting thing here is how you don't need a bunch of query logs to approximate the algorithm. You can use a corpus of mostly-correct text (like a bunch of books from Project Gutenberg).

like image 23
Adrian McCarthy Avatar answered Oct 21 '22 17:10

Adrian McCarthy