I have the following requirement: -
I have many (say 1 million) values (names). The user will type a search string.
I don't expect the user to spell the names correctly.
So, I want to make kind of Google "Did you mean". This will list all the possible values from my datastore. There is a similar but not same question here. This did not answer my question.
My question: - 1) I think it is not advisable to store those data in RDBMS. Because then I won't have filter on the SQL queries. And I have to do full table scan. So, in this situation how the data should be stored?
2) The second question is the same as this. But, just for the completeness of my question: how do I search through the large data set? Suppose, there is a name Franky in the dataset. If a user types as Phranky, how do I match the Franky? Do I have to loop through all the names?
I came across Levenshtein Distance, which will be a good technique to find the possible strings. But again, my question is do I have to operate on all 1 million values from my data store?
3) I know, Google does it by watching users behavior. But I want to do it without watching user behavior, i.e. by using, I don't know yet, say distance algorithms. Because the former method will require large volume of searches to start with!
4) As Kirk Broadhurst pointed out in an answer below, there are two possible scenarios: -
I am interested in both of these. They are really two separate things; e.g. Sean and Shawn sound the same but have an edit distance of 3 - too high to be considered a typo.
According to Google, the search feature follows a pre-determined process : a query is initiated, the web is navigated by following links from one page to another (an operation termed web crawling), applicable pages are sorted using a set of criteria and indexes, and the most suitable results delivered which are ...
Autocomplete is a feature within Google Search that makes it faster to complete searches that you start to type. Our automated systems generate predictions that help people save time by allowing them to quickly complete the search they already intended to do.
The Did You Mean algorithm works separately on every term within the query. If a search query returns less than 50 results, the Did You Mean algorithm performs the following: Primo searches for a match in the Did You Mean index. Several candidates are checked and the highest-ranking result is used.
Google figures out possible misspellings and their likely correct spellings by using words it finds while searching the web and processing user queries.
The Soundex algorithm may help you out with this.
http://en.wikipedia.org/wiki/Soundex
You could pre-generate the soundex values for each name and store it in the database, then index that to avoid having to scan the table.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With