How can queries like
SELECT * FROM sometable WHERE somefield LIKE '%value%'
be optimized?
The main issue here is the first wildcard which prevents DBMS from using index.
Edit: What is more, somefield value is solid string (not a piece of text) so fulltext search could not be performed.
MySQL WildcardsA wildcard character is used to substitute one or more characters in a string. Wildcard characters are used with the LIKE operator. The LIKE operator is used in a WHERE clause to search for a specified pattern in a column.
Optimization involves configuring, tuning, and measuring performance, at several levels. Depending on your job role (developer, DBA, or a combination of both), you might optimize at the level of individual SQL statements, entire applications, a single database server, or multiple networked database servers.
How long are your strings?
If they are relatively short (e.g. English words; avg_len=5) and you have database storage to spare, try this approach:
value
gives:
value
alue
lue
ue
e
LIKE 'alu%'
(which will find 'alu' as part of 'value').By storing all suffixes, you have removed the need for the leading wildcard (allowing an index to be used for fast lookup), at the cost of storage space.
Storage Cost
The number of characters required to store a word becomes word_len*word_len / 2
, i.e. quadratic in the word length, on a per-word basis. Here is the factor of increase for various word sizes:
(3*3/2) / 3 = 1.5
(5*5/2) / 5 = 2.5
(7*7/2) / 7 = 3.5
(12*12/2) / 12 = 6
The number of rows required to store a word increases from 1 to word_len
. Be mindful of this overhead. Additional columns should be kept to a minimum to avoid storing large amounts of redundant data. For instance, a page number on which the word was originally found should be fine (think unsigned smallint), but extensive metadata on the word should be stored in a separate table on a per-word basis, rather than for each suffix.
Considerations
There is a trade-off in where we split 'words' (or fragments). As a real-world example: what do we do with hyphens? Do we store the adjective five-letter
as one word or two?
The trade-off is as follows:
five
and letter
separately, searching for five-letter
or fiveletter
will fail.For convenience, you might want to remove the hyphen and store fiveletter
. The word can now be found by searching five
, letter
, and fiveletter
. (If you strip hyphens from any search query as well, users can still successfully find five-letter
.)
Finally, there are ways of storing suffix arrays that do not incur much overhead, but I am not yet sure if they translate well to databases.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With