Optimization of MySQL search using "like" and wildcards

Tags:

How can queries like

SELECT * FROM sometable WHERE somefield LIKE '%value%'

be optimized?

The main issue here is the first wildcard which prevents DBMS from using index.

Edit: What is more, somefield value is solid string (not a piece of text) so fulltext search could not be performed.

429

asked Jan 17 '10 18:01

Jonas

1 Answers

How long are your strings?

If they are relatively short (e.g. English words; avg_len=5) and you have database storage to spare, try this approach:

For each word that you want to store in the table, instead take every possible suffix of that word. In other words, you keep stripping the first character until nothing is left. For example, the word value gives:
- value
- alue
- lue
- ue
- e
Store each of these suffixes in the database.
You can now search for substrings using LIKE 'alu%' (which will find 'alu' as part of 'value').

By storing all suffixes, you have removed the need for the leading wildcard (allowing an index to be used for fast lookup), at the cost of storage space.

Storage Cost

The number of characters required to store a word becomes word_len*word_len / 2, i.e. quadratic in the word length, on a per-word basis. Here is the factor of increase for various word sizes:

3-letter word: (3*3/2) / 3 = 1.5
5-letter word: (5*5/2) / 5 = 2.5
7-letter word: (7*7/2) / 7 = 3.5
12-letter word: (12*12/2) / 12 = 6

The number of rows required to store a word increases from 1 to word_len. Be mindful of this overhead. Additional columns should be kept to a minimum to avoid storing large amounts of redundant data. For instance, a page number on which the word was originally found should be fine (think unsigned smallint), but extensive metadata on the word should be stored in a separate table on a per-word basis, rather than for each suffix.

Considerations

There is a trade-off in where we split 'words' (or fragments). As a real-world example: what do we do with hyphens? Do we store the adjective five-letter as one word or two?

The trade-off is as follows:

Anything that is broken up cannot be found as a single element. If we store five and letter separately, searching for five-letter or fiveletter will fail.
Anything that is not broken up will take more storage space. Remember, the storage requirement increases quadratically in the word length.

For convenience, you might want to remove the hyphen and store fiveletter. The word can now be found by searching five, letter, and fiveletter. (If you strip hyphens from any search query as well, users can still successfully find five-letter.)

Finally, there are ways of storing suffix arrays that do not incur much overhead, but I am not yet sure if they translate well to databases.

119

answered Sep 21 '22 07:09

Timo

Related questions
                            
                                Chaining JSON_EXTRACT with CAST or STR_TO_DATE fails
                            
                                PDO MySQL: Insert multiple rows in one query
                            
                                How to run script from within Mysql workbench?
                            
                                Changing Laravel MYSQL to utf8mb4 for Emoji Support in Existing Database
                            
                                MySQL database optimization best practices
                            
                                MySQL: Inner join vs Where [duplicate]
                            
                                Importing records from PostgreSQL to MySQL
                            
                                Zend DB Select : ORDER BY FIELD('id',some_array) - how?
                            
                                mysql FULLTEXT search multiple words
                            
                                MySQL: Why does basic MySQLdump on db table fail with "Permission denied"
                            
                                How to compare version string ("x.y.z") in MySQL?
                            
                                How do I get access to the mysql logs from a RDS instance
                            
                                function to sanitize input to Mysql database
                            
                                MySQL: How to bulk SELECT rows with multiple pairs in WHERE clause
                            
                                Questionable SQL practice - Order By id rather than creation time
                            
                                MySQL optimizing INSERT speed being slowed down because of indices
                            
                                Get column sum and use to calculate percent of total (mySQL)
                            
                                Cannot find the UseMysql method on DbContextOptions
                            
                                Best way to denormalize data in Django? [closed]
                            
                                Custom ORDER BY to ignore 'the'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Optimization of MySQL search using "like" and wildcards

Tags:

optimization

mysql

wildcard

Jonas

People also ask

1 Answers

Timo

Recent Activity

Donate For Us