We have a simple SQL problem here. In a varchar column, we wanted to search for a string anywhere in the field. What is the best way to implement this for performance? Obviously an index is not going to help here, any other tricks?
We are using MySQL and have about 3 million records. We need to execute many of these queries per second so really trying to implement these with the best performance.
The most simple way to do this is so far is:
Select * from table where column like '%search%'
I should further specify that the column is actually a long string like "sadfasdfwerwe" and I have to search for "asdf" in this column. So they are not sentences and trying to match a word in them. Would full text search still help here?
Check out my presentation Practical Fulltext Search in MySQL.
I compared:
LIKE
predicatesLIKE
)Today what I would use is Apache Solr, which puts Lucene into a service with a bunch of extra features and tools.
Re your comment: Aha, okay, no. None of the fulltext search capabilities I mentioned are going to help, since they all assume some kind of word boundaries
The other way to efficiently find arbitrary substrings is the N-gram approach. Basically, create an index of all possible sequences of N letters and point to the strings where each respective sequence occurs. Typically this is done with N=3, or a trigram, because it's a point of compromise between matching longer substrings and keeping the index to a manageable size.
I don't know of any SQL database that supports N-gram indexing transparently, but you could set it up yourself using an inverted index:
create table trigrams (
trigram char(3) primary key
);
create table trigram_matches (
trigram char(3),
document_id int,
primary key (trigram, document_id),
foreign key (trigram) references trigrams(trigram),
foreign key (document_id) references mytable(document_id)
);
Now populate it the hard way:
insert into trigram_matches
select t.trigram, d.document_id
from trigrams t join mytable d
on d.textcolumn like concat('%', t.trigram, '%');
Of course this will take quite a while! But once it's done, you can search much more quickly:
select d.*
from mytable d join trigram_matches t
on t.document_id = d.document_id
where t.trigram = 'abc'
Of course you could be searching for patterns longer than three characters, but the inverted index still helps to narrow your search a lot:
select d.*
from mytable d join trigram_matches t
on t.document_id = d.document_id
where t.trigram = 'abc'
and d.textcolumn like '%abcdef%';
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With