How do I query for records ordered by similarity?
Eg. searching for "Stock Overflow" would return
Eg. searching for "LO" would return:
Using a search engine to index & search a MySQL table, for better results
Using the Sphinx search engine, with PHP
Using the Lucene engine with PHP
Using full-text indexing, to find similar/containing strings
LIKE
returns better results, but returns nothing for long queries although similar strings do exist I have found out that the Levenshtein distance may be good when you are searching a full string against another full string, but when you are looking for keywords within a string, this method does not return (sometimes) the wanted results. Moreover, the SOUNDEX function is not suitable for languages other than english, so it is quite limited. You could get away with LIKE, but it's really for basic searches. You may want to look into other search methods for what you want to achieve. For example:
You may use Lucene as search base for your projects. It's implemented in most major programming languages and it'd quite fast and versatile. This method is probably the best, as it not only search for substrings, but also letter transposition, prefixes and suffixes (all combined). However, you need to keep a separate index (using CRON to update it from a independent script once in a while works though).
Or, if you want a MySQL solution, the fulltext functionality is pretty good, and certainly faster than a stored procedure. If your tables are not MyISAM, you can create a temporary table, then perform your fulltext search :
CREATE TABLE IF NOT EXISTS `tests`.`data_table` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `title` varchar(2000) CHARACTER SET latin1 NOT NULL, `description` text CHARACTER SET latin1 NOT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin AUTO_INCREMENT=1 ;
Use a data generator to generate some random data if you don't want to bother creating it yourself...
** NOTE ** : the column type should be latin1_bin
to perform a case sensitive search instead of case insensitive with latin1
. For unicode strings, I would recommend utf8_bin
for case sensitive and utf8_general_ci
for case insensitive searches.
DROP TABLE IF EXISTS `tests`.`data_table_temp`; CREATE TEMPORARY TABLE `tests`.`data_table_temp` SELECT * FROM `tests`.`data_table`; ALTER TABLE `tests`.`data_table_temp` ENGINE = MYISAM; ALTER TABLE `tests`.`data_table_temp` ADD FULLTEXT `FTK_title_description` ( `title` , `description` ); SELECT *, MATCH (`title`,`description`) AGAINST ('+so* +nullam lorem' IN BOOLEAN MODE) as `score` FROM `tests`.`data_table_temp` WHERE MATCH (`title`,`description`) AGAINST ('+so* +nullam lorem' IN BOOLEAN MODE) ORDER BY `score` DESC; DROP TABLE `tests`.`data_table_temp`;
Read more about it from the MySQL API reference page
The downside to this is that it will not look for letter transposition or "similar, sounds like" words.
** UPDATE **
Using Lucene for your search, you will simply need to create a cron job (all web hosts have this "feature") where this job will simply execute a PHP script (i.g. "cd /path/to/script; php searchindexer.php") that will update the indexes. The reason being that indexing thousands of "documents" (rows, data, etc.) may take several seconds, even minutes, but this is to ensure that all searches are performed as fast as possible. Therefore, you may want to create a delay job to be run by the server. It may be overnight, or in the next hour, this is up to you. The PHP script should look something like this:
$indexer = Zend_Search_Lucene::create('/path/to/lucene/data'); Zend_Search_Lucene_Analysis_Analyzer::setDefault( // change this option for your need new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive() ); $rowSet = getDataRowSet(); // perform your SQL query to fetch whatever you need to index foreach ($rowSet as $row) { $doc = new Zend_Search_Lucene_Document(); $doc->addField(Zend_Search_Lucene_Field::text('field1', $row->field1, 'utf-8')) ->addField(Zend_Search_Lucene_Field::text('field2', $row->field2, 'utf-8')) ->addField(Zend_Search_Lucene_Field::unIndexed('someValue', $someVariable)) ->addField(Zend_Search_Lucene_Field::unIndexed('someObj', serialize($obj), 'utf-8')) ; $indexer->addDocument($doc); } // ... you can get as many $rowSet as you want and create as many documents // as you wish... each document doesn't necessarily need the same fields... // Lucene is pretty flexible on this $indexer->optimize(); // do this every time you add more data to you indexer... $indexer->commit(); // finalize the process
Then, this is basically how you search (basic search) :
$index = Zend_Search_Lucene::open('/path/to/lucene/data'); // same search options Zend_Search_Lucene_Analysis_Analyzer::setDefault( new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive() ); Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('utf-8'); $query = 'php +field1:foo'; // search for the word 'php' in any field, // +search for 'foo' in field 'field1' $hits = $index->find($query); $numHits = count($hits); foreach ($hits as $hit) { $score = $hit->score; // the hit weight $field1 = $hit->field1; // etc. }
Here are great sites about Lucene in Java, PHP, and .Net.
In conclusion each search methods have their own pros and cons :
Please feel free to comment if I have forgotten/missed anything.
1. Similarity
For Levenshtein in MySQL I found this, from www.codejanitor.com/wp/2007/02/10/levenshtein-distance-as-a-mysql-stored-function
SELECT column, LEVENSHTEIN(column, 'search_string') AS distance FROM table WHERE LEVENSHTEIN(column, 'search_string') < distance_limit ORDER BY distance DESC
2. Containing, case insensitive
Use the LIKE
statement of MySQL, which is case insensitive by default. The %
is a wildcard, so there may be any string before and after search_string
.
SELECT * FROM table WHERE column_name LIKE "%search_string%"
3. Containing, case sensitive
The MySQL Manual helps:
The default character set and collation are latin1 and latin1_swedish_ci, so nonbinary string comparisons are case insensitive by default. This means that if you search with col_name LIKE 'a%', you get all column values that start with A or a. To make this search case sensitive, make sure that one of the operands has a case sensitive or binary collation. For example, if you are comparing a column and a string that both have the latin1 character set, you can use the COLLATE operator to cause either operand to have the latin1_general_cs or latin1_bin collation...
My MySQL setup does not support latin1_general_cs
or latin1_bin
, but it worked fine for me to use the collation utf8_bin
as binary utf8 is case sensitive:
SELECT * FROM table WHERE column_name LIKE "%search_string%" COLLATE utf8_bin
2. / 3. sorted by Levenshtein Distance
SELECT column, LEVENSHTEIN(column, 'search_string') AS distance // for sorting FROM table WHERE column_name LIKE "%search_string%" COLLATE utf8_bin // for case sensitivity, just leave out for CI ORDER BY distance DESC
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With