Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best way to implement a substring search in SQL?

We have a simple SQL problem here. In a varchar column, we wanted to search for a string anywhere in the field. What is the best way to implement this for performance? Obviously an index is not going to help here, any other tricks?

We are using MySQL and have about 3 million records. We need to execute many of these queries per second so really trying to implement these with the best performance.

The most simple way to do this is so far is:

Select * from table where column like '%search%'

I should further specify that the column is actually a long string like "sadfasdfwerwe" and I have to search for "asdf" in this column. So they are not sentences and trying to match a word in them. Would full text search still help here?

like image 615
erotsppa Avatar asked Jul 23 '10 17:07

erotsppa


1 Answers

Check out my presentation Practical Fulltext Search in MySQL.

I compared:

  • LIKE predicates
  • Regular expression predicates (no better than LIKE)
  • MyISAM FULLTEXT indexing
  • Sphinx Search
  • Apache Lucene
  • Inverted indexing
  • Google Custom Search Engine

Today what I would use is Apache Solr, which puts Lucene into a service with a bunch of extra features and tools.


Re your comment: Aha, okay, no. None of the fulltext search capabilities I mentioned are going to help, since they all assume some kind of word boundaries

The other way to efficiently find arbitrary substrings is the N-gram approach. Basically, create an index of all possible sequences of N letters and point to the strings where each respective sequence occurs. Typically this is done with N=3, or a trigram, because it's a point of compromise between matching longer substrings and keeping the index to a manageable size.

I don't know of any SQL database that supports N-gram indexing transparently, but you could set it up yourself using an inverted index:

create table trigrams (
  trigram char(3) primary key
);

create table trigram_matches (
  trigram char(3),
  document_id int,
  primary key (trigram, document_id),
  foreign key (trigram) references trigrams(trigram),
  foreign key (document_id) references mytable(document_id)
);

Now populate it the hard way:

insert into trigram_matches
  select t.trigram, d.document_id
  from trigrams t join mytable d
    on d.textcolumn like concat('%', t.trigram, '%');

Of course this will take quite a while! But once it's done, you can search much more quickly:

select d.*
from mytable d join trigram_matches t
  on t.document_id = d.document_id
where t.trigram = 'abc'

Of course you could be searching for patterns longer than three characters, but the inverted index still helps to narrow your search a lot:

select d.*
from mytable d join trigram_matches t
  on t.document_id = d.document_id
where t.trigram = 'abc'
  and d.textcolumn like '%abcdef%';
like image 76
Bill Karwin Avatar answered Oct 31 '22 16:10

Bill Karwin