Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQLite: Efficient substring search in large table

I'm developing an Android application that has to perform substring search in a large table (about 500'000 entries with street and location names, so just a few words per entry).

CREATE TABLE Elements (elementID INTEGER, type INTEGER, name TEXT, data BLOB)

Note that only 20% of all entries contain strings in the "name" column.

Performing the following query almost takes 2 minutes:

SELECT elementID, name FROM Elements WHERE name LIKE %foo%

I now tried to use FTS3 in order to speed up the query. That was quite successful, query time decreased to 1 minute (surprisingly the database file size increased by only 5%, which is also quite good for my purpose).

The problem is, FTS3 seemingly doesn't support substring search, i.e. if I want to find "bar" in "foo bar" and "foobar", I only get "foo bar", although I need both results.

So actually I have two questions:

  1. Is it possible to further speed up the query? My goal is 30 seconds for the query, but I don't know if that's realistic...

  2. How can I get real substring search using FTS3?

like image 493
Aletheios Avatar asked Jul 04 '12 19:07

Aletheios


People also ask

What is FTS SQLite?

FTS5 is an SQLite virtual table module that provides full-text search functionality to database applications.

How many rows SQLite can handle?

The theoretical maximum number of rows in a table is 264 (18446744073709551616 or about 1.8e+19). This limit is unreachable since the maximum database size of 281 terabytes will be reached first.

Is SQLite slower?

The SQLite3 database will just return the single record that you wanted directly. So your "sequential file" is 50,000 times slower than SQLite3 at retrieving data.


1 Answers

Solution 1: If you can make every character in your database as an individual word, you can use phrase queries to search the substring.

For example, assume "my_table" contains a single column "person":

person
------
John Doe
Jane Doe

you can change it to

person
------
J o h n D o e
J a n e D o e

To search the substring "ohn", use phrase query:

SELECT * FROM my_table WHERE person MATCH '"o h n"'

Beware that "JohnD" will match "John Doe", which may not be desired. To fix it, change the space character in the original string into something else.

For example, you can replace the space character with "$":

person
------
J o h n $ D o e
J a n e $ D o e

Solution 2: Following the idea of solution 1, you can make every character as an individual word with a custom tokenizer and use phrase queries to query substrings.

The advantage over solution 1 is that you don't have to add spaces in your data, which can unnecessarily increase the size of database.

The disadvantage is that you have to implement the custom tokenizer. Fortunately, I have one ready for you. The code is in C, so you have to figure out how to integrate it with your Java code.

like image 164
Hai Feng Kao Avatar answered Sep 30 '22 15:09

Hai Feng Kao