Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Search a text for multiple keywords

Tags:

php

mysql

I need to search a text (around 500 words long) for words in an English dictionary (around 275,000 keywords) to detect non-English words and right now the query I am using is not really optimized which takes more than 10 seconds to execute (there's a words table and a texts table):

SELECT word FROM words WHERE 'The quick brown fox jumps over the lazy dog' LIKE CONCAT( '%', word, '%' );

Got the idea from here.

I have already set the word field as an Index and seen some examples of people storing the text in the database or putting it directly in the query.

Other examples showed people using FULLTEXT search although having 300k words I don't think a FULLTEXT will work, I guess it's good to search with logic +brown +lazy -apple but in my case I don't need much logic.

Another example I've seen is to concate words with the IN (...) clause although having 500m keywords the query would just be insanely long.

Any ideas what to do?

Right now the text is saved as a text field and the words as varchar(50) in InnoDB with utf8_unicode_ci encoding, I've heard InnoDB is slow so I could use MyISAM or any other. I am using MySQL 5.5 although I could update to 5.6 if that helped.

like image 939
Kenzier Avatar asked Oct 31 '22 19:10

Kenzier


1 Answers

LIKE comparisons are basically just wildcard-capable equality tests. They are not a generic keyword search engine.

WHERE foo LIKE '%a b%' would find any records that contain the literal text a b anywhere in the foo field, they don't look for a or b separately, a b is a single monolithic "word" and that word is searched for in its entirety.

If you want to search for multiple "words" using LIKE, you have to do

WHERE foo LIKE '%a%' OR foo LIKE '%b%' OR etc...

which quickly gets ugly, and extremely inefficient - %... search cannot use indexes.

You'd be better off switching to a fulltext search system instead, where you can have the far simpler

WHERE MATCH(foo) AGAINST ('a b')
like image 77
Marc B Avatar answered Nov 02 '22 09:11

Marc B