Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP - How to suggest terms for search, "did you mean...?"

Tags:

php

search

When searching the db with terms that retrieve no results I want to allow "did you mean..." suggestion (like Google). So for example if someone looks for "jquyer" ", it would output "did you mean jquery?"

Of course, suggestion results have to be matched against the values inside the db (i'm using mysql).

Do you know a library that can do this? I've googled this but haven't found any great results. Or perhaps you have an idea how to construct this on my own?

like image 413
Gal Avatar asked Dec 11 '09 14:12

Gal


2 Answers

When I did this a couple of years ago, I already had a custom built index of words that the search engine used. I studied what kinds of errors people made the most (based on logs) and sorted the suggestions based on how common the mistake was.

If someone searched for jQuery, I would build a select-statement that went

SELECT Word, 1 AS Relevance 
FROM keywords 
WHERE Word IN ('qjuery','juqery','jqeury' etc)  

UNION 

SELECT Word, 2 AS Relevance 
FROM keywords 
WHERE Word LIKE 'j_query' OR Word LIKE 'jq_uery' etc etc 
ORDER BY Relevance, Word  

The resulting words were my suggestions and it worked really well.

like image 82
Oliver John Avatar answered Sep 21 '22 10:09

Oliver John


A quick and easy solution involves SOUNDEX or SOUNDEX-like functions.

In a nutshell the SOUNDEX function was originally used to deal with common typos and alternate spellings for family names, and this function, encapsulates very well many common spelling mistakes (in the english language). Because of its focus on family names, the original soundex function may be limiting (for example encoding stops after the third or fourth non-repeating consonant letter), but it is easy to expend the algorithm.

The interest of this type of function is that it allows computing, ahead of time, a single value which can be associated with the word. This is unlike string distance functions such as edit distance functions (such as Levenshtein, Hamming or even Ratcliff/Obershelp) which provide a value relative to a pair of strings.

By pre-computing and indexing the SOUNDEX value for all words in the dictionary, one can, at run-time, quickly search the dictionary/database based on the [run-time] calculated SOUNDEX value of the user-supplied search terms. This Soundex search can be done systematically, as complement to the plain keyword search, or only performed when the keyword search didn't yield a satisfactory number of records, hence providing the hint that maybe the user-supplied keyword(s) is (are) misspelled.


A totally different approach, only applicable on user queries which include several words, is based on running multiple queries against the dictionary/database, excluding one (or several) of the user-supplied keywords. These alternate queries' result lists provide a list of distinct words; This [reduced] list of words is typically small enough that pair-based distance functions can be applied to select, within the list, the words which are closer to the allegedly misspelled word(s). The word frequency (within the results lists) can be used to both limit the number of words (only evaluate similarity for the words which are found more than x times), as well as to provide weight, to slightly skew the similarity measurements (i.e favoring words found "in quantity" in the database, even if their similarity measurement is slightly less).

like image 36
mjv Avatar answered Sep 21 '22 10:09

mjv