Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimization of MySQL search using "like" and wildcards

How can queries like

SELECT * FROM sometable WHERE somefield LIKE '%value%'

be optimized?

The main issue here is the first wildcard which prevents DBMS from using index.

Edit: What is more, somefield value is solid string (not a piece of text) so fulltext search could not be performed.

like image 429
Jonas Avatar asked Jan 17 '10 18:01

Jonas


People also ask

Can we use wildcard in MySQL?

MySQL WildcardsA wildcard character is used to substitute one or more characters in a string. Wildcard characters are used with the LIKE operator. The LIKE operator is used in a WHERE clause to search for a specified pattern in a column.

What is optimization in MySQL?

Optimization involves configuring, tuning, and measuring performance, at several levels. Depending on your job role (developer, DBA, or a combination of both), you might optimize at the level of individual SQL statements, entire applications, a single database server, or multiple networked database servers.


1 Answers

How long are your strings?

If they are relatively short (e.g. English words; avg_len=5) and you have database storage to spare, try this approach:

  • For each word that you want to store in the table, instead take every possible suffix of that word. In other words, you keep stripping the first character until nothing is left. For example, the word value gives:
    • value
    • alue
    • lue
    • ue
    • e
  • Store each of these suffixes in the database.
  • You can now search for substrings using LIKE 'alu%' (which will find 'alu' as part of 'value').

By storing all suffixes, you have removed the need for the leading wildcard (allowing an index to be used for fast lookup), at the cost of storage space.

Storage Cost

The number of characters required to store a word becomes word_len*word_len / 2, i.e. quadratic in the word length, on a per-word basis. Here is the factor of increase for various word sizes:

  • 3-letter word: (3*3/2) / 3 = 1.5
  • 5-letter word: (5*5/2) / 5 = 2.5
  • 7-letter word: (7*7/2) / 7 = 3.5
  • 12-letter word: (12*12/2) / 12 = 6

The number of rows required to store a word increases from 1 to word_len. Be mindful of this overhead. Additional columns should be kept to a minimum to avoid storing large amounts of redundant data. For instance, a page number on which the word was originally found should be fine (think unsigned smallint), but extensive metadata on the word should be stored in a separate table on a per-word basis, rather than for each suffix.

Considerations

There is a trade-off in where we split 'words' (or fragments). As a real-world example: what do we do with hyphens? Do we store the adjective five-letter as one word or two?

The trade-off is as follows:

  • Anything that is broken up cannot be found as a single element. If we store five and letter separately, searching for five-letter or fiveletter will fail.
  • Anything that is not broken up will take more storage space. Remember, the storage requirement increases quadratically in the word length.

For convenience, you might want to remove the hyphen and store fiveletter. The word can now be found by searching five, letter, and fiveletter. (If you strip hyphens from any search query as well, users can still successfully find five-letter.)

Finally, there are ways of storing suffix arrays that do not incur much overhead, but I am not yet sure if they translate well to databases.

like image 119
Timo Avatar answered Sep 21 '22 07:09

Timo