Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Handling simple grammar in a PHP search engine

Tags:

php

search

mysql

I am creating a simple search function for my website using MySQL and PHP. Right now, if type the word "cat" into the search bar, I will NOT be able to retrieve articles with the word "cats", and vice-versa. It is the same with the ending "ed".

The only way that I can think of to solve this problem is by removing all "s" and "ed" from the end of each word that is longer than a certain length (to avoid turning "Ted" into "T", etc). However, this simple solution is nowhere near perfect. I'm hoping someone can provide me with a better solution.

like image 294
Leo Jiang Avatar asked Dec 12 '22 00:12

Leo Jiang


1 Answers

The technique you are referring to is called stemming. Because of the great many influences on languages this is a difficult thing to handle on your own at the application level. If you do not want to deal with this you can let MySQL do the heavy lifting for you depending on what version of MySQL you are running. If you are on version 5.6.4 or later it is built into the full-text search mechanism for both MyISAM tables and InnoDB tables. In versions 5.5 through 5.6.3 it is built in for MyISAM but not InnoDB tables. For version 5.1 there is a plugin available from mnoGoSearch. Prior to 5.1 I think you need to handle it at the application level but I have not confirmed that.

These links might help get you started.

  • http://dev.mysql.com/doc/refman/5.6/en/glossary.html#glos_stemming
  • http://dev.mysql.com/doc/refman/5.6/en/glossary.html#glos_full_text_search
  • http://dev.mysql.com/doc/refman/5.6/en/glossary.html#glos_fulltext_index
  • http://dev.mysql.com/doc/refman/5.6/en/fulltext-search.html

Be aware of the stopword list which is a list of very common and often short words that are ignored in your search text when the query is processed. There are settings to control the stopword list if it is preventing you from getting expected results. You will likely want to set the minimum word length to 2 or 3 (default is 4) and remove many of the words on the default list.

If you do want to handle stemming on your own or with PHP there is a detailed technical discussion of the Porter Stemming Algorithm by Martin Porter and there are at least two PHP implementations available, an older one in PHP4 by Jon Abernathy that may have some flaws and a newer one in PHP5 by Richard Heyes.

I am assuming that you are primarily concerned with English but I believe that there is some support for other languages as well.

As mentioned by rnmccall if you need more advanced search capabilities you may need to go with Sphinx or Apache Lucene.

like image 101
Night Owl Avatar answered Dec 25 '22 20:12

Night Owl