Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast(er) method for wildcard searching of 250K+ strings

I have an English dictionary in a MySQL database with just over 250K entries, and I'm using a simple ruby front-end to search it using wildcards at the beginning of the strings. So far I've been doing it like this:

SELECT * FROM words WHERE word LIKE '_e__o'

or even

SELECT * FROM words WHERE word LIKE '____s'

I always know the exact length of the word, but all but a single character are potentially unknown.

This is slower than molasses, about fifteen times slower than a similar query without the leading wildcard because the index for the column cannot be used.

I've tried a few methods to narrow the scope of the search. For example, I've added 26 additional columns containing each word's individual letter counts and narrow the search using those first. I've also tried narrowing by word length. These methods made almost no difference, thanks to the inherent inefficiency of leading-wildcard searches. I've experimented with the REGEXP statement, which is even slower.

SQLite and PostgreSQL are just as limited as MySQL, and though I have limited experience with NoSQL systems, my research gives me the impression that they excel at scalability, not performance of the kind I need.

My question then, is where should I look for a solution? Should I continue trying to find a way to optimize my queries or add supplementary columns that can narrow my potential recordset? Are there systems designed specifically to accomplish fast wildcard searching in this vein?

like image 793
Daniel Avatar asked Apr 11 '12 21:04

Daniel


1 Answers

With PostgreSQL 9.1 and the pg_trgm extension you can create indexes that are usable for a like condition you are describing.

For an example see here: http://www.depesz.com/2011/02/19/waiting-for-9-1-faster-likeilike/

I verified it on a table with 300k rows using LIKE '____1' and it does use such an index. It took about 120ms to count the number of rows in that table (on an old laptop). Interesting enough the expression LIKE 'd___1' is not faster, it's about the same speed.

It also depends on the number of characters in the search term, the longe it gets, the slower it will be as far as I can tell.

You would need to check with your data if the performance is acceptable.

like image 116
a_horse_with_no_name Avatar answered Oct 02 '22 12:10

a_horse_with_no_name