Background I'm implementing full-text search over a body of email messages stored in SQLite, making use of its fantastic built-in FTS4 engine. I'm getting some rather poor query performance, although not exactly where I would expect. Let's take a look. Representative schema I'll give some simplified examples of the code in question, with links to the full code where applicable. We've got a <code>MessageTable</code> that stores the data about an email message (full version spread out over several files here, here, and here): <pre class="prettyprint"><code>CREATE TABLE MessageTable ( id INTEGER PRIMARY KEY, internaldate_time_t INTEGER ); CREATE INDEX MessageTableInternalDateTimeTIndex ON MessageTable(internaldate_time_t); </code></pre> The searchable text is added to an FTS4 table named <code>MessageSearchTable</code> (full version here): <pre class="prettyprint"><code>CREATE VIRTUAL TABLE MessageSearchTable USING fts4( id INTEGER PRIMARY KEY, body ); </code></pre> The <code>id</code> in the search table acts as a foreign key to the message table. I'll leave it as an exercise for the reader to insert data into these tables (I certainly can't give out my private email). I have just under 26k records in each table. Problem query When we retrieve search results, we need them to be ordered descending by <code>internaldate_time_t</code> so we can pluck out only the most recent few results. Here's an example search query (full version here): <pre class="prettyprint"><code>SELECT id FROM MessageSearchTable JOIN MessageTable USING (id) WHERE MessageSearchTable MATCH 'a' ORDER BY internaldate_time_t DESC LIMIT 10 OFFSET 0 </code></pre> On my machine, with my email, that runs in about 150 milliseconds, as measured via: <pre class="prettyprint"><code>time sqlite3 test.db <<<"..." > /dev/null </code></pre> 150 milliseconds is no beast of a query, but for a simple FTS lookup and indexed order, it's sluggish. If I omit the <code>ORDER BY</code>, it completes in 10 milliseconds, for example. Also keep in mind that the actual query has one more sub-select, so there's a little more work going on in general: the full version of the query runs in about 600 milliseconds, which is into beast territory, and omitting the <code>ORDER BY</code> in that case shaves 500 milliseconds off the time. If I turn on stats inside <code>sqlite3</code> and run the query, I notice the line: <pre class="prettyprint"><code>Sort Operations: 1 </code></pre> If my interpretation of the docs about those stats is correct, it looks like the query is completely skipping using the <code>MessageTableInternalDateTimeTIndex</code>. The full version of the query also has the line: <pre class="prettyprint"><code>Fullscan Steps: 25824 </code></pre> Sounds like it's walking the table somewhere, but let's ignore that for now. What I've discovered So let's work on optimizing that a little bit. I can rearrange the query into a sub-select and force SQLite to use our index with the <code>INDEXED BY</code> extension: <pre class="prettyprint"><code>SELECT id FROM MessageTable INDEXED BY MessageTableInternalDateTimeTIndex WHERE id IN ( SELECT id FROM MessageSearchTable WHERE MessageSearchTable MATCH 'a' ) ORDER BY internaldate_time_t DESC LIMIT 10 OFFSET 0 </code></pre> Lo and behold, the running time has dropped to around 100 milliseconds (300 milliseconds in the full version of the query, a 50% reduction in running time), and there are no sort operations reported. Note that with just reorganizing the query like this but not forcing the index with <code>INDEXED BY</code>, there's still a sort operation (though we've still shaved off a few milliseconds oddly enough), so it appears that SQLite is indeed ignoring our index unless we force it. I've also tried some other things to see if they'd make a difference, but they didn't: <ul> <li>Explicitly making the index <code>DESC</code> as described here, with and without <code>INDEXED BY</code> </li> <li>Explicitly adding the <code>id</code> column in the index, with and without <code>internaldate_time_t</code> ordered <code>DESC</code>, with and without <code>INDEXED BY</code> </li> <li>Probably several other things I can't remember at this moment</li> </ul> Questions 100 milliseconds here still seems awfully slow for what seems like it should be a simple FTS lookup and indexed order. <ul> <li>What's going on here? Why is it ignoring the obvious index unless you force its hand?</li> <li>Am I hitting some limitation with combining data from virtual and regular tables?</li> <li>Why is it still so relatively slow, and is there anything else I can do to get FTS matches ordered by a field in another table?</li> </ul> Thanks!

An index is useful for looking up a table row based on the value of the indexed column. Once a table row is found, indexes are no longer useful because it is not efficient to look up a table row in an index by any other criterium. An implication of this is that it is not possible to use more than one index for each table accessed in a query. Also see the documentation: Query Planning, Query Optimizer. <hr> Your first query has the following EXPLAIN QUERY PLAN output: <pre class="prettyprint"><code>0 0 0 SCAN TABLE MessageSearchTable VIRTUAL TABLE INDEX 4: (~0 rows) 0 1 1 SEARCH TABLE MessageTable USING INTEGER PRIMARY KEY (rowid=?) (~1 rows) 0 0 0 USE TEMP B-TREE FOR ORDER BY </code></pre> What happens is that <ol> <li>the FTS index is used to find all matching <code>MessageSearchTable</code> rows;</li> <li>for each row found in 1., the <code>MessageTable</code> primary key index is used to find the matching row;</li> <li>all rows found in 2. are sorted with a temporary table;</li> <li>the first 10 rows are returned.</li> </ol> Your second query has the following EXPLAIN QUERY PLAN output: <pre class="prettyprint"><code>0 0 0 SCAN TABLE MessageTable USING COVERING INDEX MessageTableInternalDateTimeTIndex (~100000 rows) 0 0 0 EXECUTE LIST SUBQUERY 1 1 0 0 SCAN TABLE MessageSearchTable VIRTUAL TABLE INDEX 4: (~0 rows) </code></pre> What happens is that <ol> <li>the FTS index is used to find all matching <code>MessageSearchTable</code> rows;</li> <li>SQLite goes through all entries in the <code>MessageTableInternalDateTimeTIndex</code> in the index order, and returns a row when the <code>id</code> value is one of the values found in step 1. SQLite stops after the tenth such row.</li> </ol> In this query, it is possible to use the index for (implied) sorting, but only because no other index is used for looking up rows in this table. Using an index in this way implies that SQLite has to go through all entries, instead of lookup up the few rows that match some other condition. When you omit the <code>INDEXED BY</code> clause from your second query, you get the following EXPLAIN QUERY PLAN output: <pre class="prettyprint"><code>0 0 0 SEARCH TABLE MessageTable USING INTEGER PRIMARY KEY (rowid=?) (~25 rows) 0 0 0 EXECUTE LIST SUBQUERY 1 1 0 0 SCAN TABLE MessageSearchTable VIRTUAL TABLE INDEX 4: (~0 rows) 0 0 0 USE TEMP B-TREE FOR ORDER BY </code></pre> which is essentially the same as your first query, except that joins and subqueries are handled slightly differently. <hr> With your table structure, it is not really possible to get faster. You are doing three operations: <ol> <li>looking up rows in <code>MessageSearchTable</code>;</li> <li>looking up corresponding rows in <code>MessageTable</code>;</li> <li>sorting rows by a <code>MessageTable</code> value.</li> </ol> As far as indexes are concerned, steps 2 and 3 conflict with each other. The database has to choose whether to use an index for step 2 (in which case sorting must be done explicitly) or for step 3 (in which case it has to go through all <code>MessageTable</code> entries). You could try to return fewer records from the FTS search by making the message time a part of the FTS table and searching only for the last few days (and increasing or dropping the time if you don't get enough results).

how can I get faster FTS4 query results ordered by a field in another table?

Tags:

sqlite

fts3

fts4

Background

I'm implementing full-text search over a body of email messages stored in SQLite, making use of its fantastic built-in FTS4 engine. I'm getting some rather poor query performance, although not exactly where I would expect. Let's take a look.

Representative schema

I'll give some simplified examples of the code in question, with links to the full code where applicable.

We've got a MessageTable that stores the data about an email message (full version spread out over several files here, here, and here):

CREATE TABLE MessageTable (
    id INTEGER PRIMARY KEY,
    internaldate_time_t INTEGER
);
CREATE INDEX MessageTableInternalDateTimeTIndex
    ON MessageTable(internaldate_time_t);

The searchable text is added to an FTS4 table named MessageSearchTable (full version here):

CREATE VIRTUAL TABLE MessageSearchTable USING fts4(
    id INTEGER PRIMARY KEY,
    body
);

The id in the search table acts as a foreign key to the message table.

I'll leave it as an exercise for the reader to insert data into these tables (I certainly can't give out my private email). I have just under 26k records in each table.

Problem query

When we retrieve search results, we need them to be ordered descending by internaldate_time_t so we can pluck out only the most recent few results. Here's an example search query (full version here):

SELECT id
FROM MessageSearchTable
JOIN MessageTable USING (id)
WHERE MessageSearchTable MATCH 'a'
ORDER BY internaldate_time_t DESC
LIMIT 10 OFFSET 0

On my machine, with my email, that runs in about 150 milliseconds, as measured via:

time sqlite3 test.db <<<"..." > /dev/null

150 milliseconds is no beast of a query, but for a simple FTS lookup and indexed order, it's sluggish. If I omit the ORDER BY, it completes in 10 milliseconds, for example. Also keep in mind that the actual query has one more sub-select, so there's a little more work going on in general: the full version of the query runs in about 600 milliseconds, which is into beast territory, and omitting the ORDER BY in that case shaves 500 milliseconds off the time.

If I turn on stats inside sqlite3 and run the query, I notice the line:

Sort Operations:                     1

If my interpretation of the docs about those stats is correct, it looks like the query is completely skipping using the MessageTableInternalDateTimeTIndex. The full version of the query also has the line:

Fullscan Steps:                      25824

Sounds like it's walking the table somewhere, but let's ignore that for now.

What I've discovered

So let's work on optimizing that a little bit. I can rearrange the query into a sub-select and force SQLite to use our index with the INDEXED BY extension:

SELECT id
FROM MessageTable
INDEXED BY MessageTableInternalDateTimeTIndex
WHERE id IN (
    SELECT id
    FROM MessageSearchTable
    WHERE MessageSearchTable MATCH 'a'
)
ORDER BY internaldate_time_t DESC
LIMIT 10 OFFSET 0

Lo and behold, the running time has dropped to around 100 milliseconds (300 milliseconds in the full version of the query, a 50% reduction in running time), and there are no sort operations reported. Note that with just reorganizing the query like this but not forcing the index with INDEXED BY, there's still a sort operation (though we've still shaved off a few milliseconds oddly enough), so it appears that SQLite is indeed ignoring our index unless we force it.

I've also tried some other things to see if they'd make a difference, but they didn't:

Explicitly making the index DESC as described here, with and without INDEXED BY
Explicitly adding the id column in the index, with and without internaldate_time_t ordered DESC, with and without INDEXED BY
Probably several other things I can't remember at this moment

Questions

100 milliseconds here still seems awfully slow for what seems like it should be a simple FTS lookup and indexed order.

What's going on here? Why is it ignoring the obvious index unless you force its hand?
Am I hitting some limitation with combining data from virtual and regular tables?
Why is it still so relatively slow, and is there anything else I can do to get FTS matches ordered by a field in another table?

Thanks!

684

asked Aug 24 '13 00:08

chazomaticus

1 Answers

An index is useful for looking up a table row based on the value of the indexed column. Once a table row is found, indexes are no longer useful because it is not efficient to look up a table row in an index by any other criterium.

An implication of this is that it is not possible to use more than one index for each table accessed in a query.

Also see the documentation: Query Planning, Query Optimizer.

Your first query has the following EXPLAIN QUERY PLAN output:

0 0 0 SCAN TABLE MessageSearchTable VIRTUAL TABLE INDEX 4: (~0 rows)
0 1 1 SEARCH TABLE MessageTable USING INTEGER PRIMARY KEY (rowid=?) (~1 rows)
0 0 0 USE TEMP B-TREE FOR ORDER BY

What happens is that

the FTS index is used to find all matching MessageSearchTable rows;
for each row found in 1., the MessageTable primary key index is used to find the matching row;
all rows found in 2. are sorted with a temporary table;
the first 10 rows are returned.

Your second query has the following EXPLAIN QUERY PLAN output:

0 0 0 SCAN TABLE MessageTable USING COVERING INDEX MessageTableInternalDateTimeTIndex (~100000 rows)
0 0 0 EXECUTE LIST SUBQUERY 1
1 0 0 SCAN TABLE MessageSearchTable VIRTUAL TABLE INDEX 4: (~0 rows)

What happens is that

the FTS index is used to find all matching MessageSearchTable rows;
SQLite goes through all entries in the MessageTableInternalDateTimeTIndex in the index order, and returns a row when the id value is one of the values found in step 1. SQLite stops after the tenth such row.

In this query, it is possible to use the index for (implied) sorting, but only because no other index is used for looking up rows in this table. Using an index in this way implies that SQLite has to go through all entries, instead of lookup up the few rows that match some other condition.

When you omit the INDEXED BY clause from your second query, you get the following EXPLAIN QUERY PLAN output:

0 0 0 SEARCH TABLE MessageTable USING INTEGER PRIMARY KEY (rowid=?) (~25 rows)
0 0 0 EXECUTE LIST SUBQUERY 1
1 0 0 SCAN TABLE MessageSearchTable VIRTUAL TABLE INDEX 4: (~0 rows)
0 0 0 USE TEMP B-TREE FOR ORDER BY

which is essentially the same as your first query, except that joins and subqueries are handled slightly differently.

With your table structure, it is not really possible to get faster. You are doing three operations:

looking up rows in MessageSearchTable;
looking up corresponding rows in MessageTable;
sorting rows by a MessageTable value.

As far as indexes are concerned, steps 2 and 3 conflict with each other. The database has to choose whether to use an index for step 2 (in which case sorting must be done explicitly) or for step 3 (in which case it has to go through all MessageTable entries).

You could try to return fewer records from the FTS search by making the message time a part of the FTS table and searching only for the last few days (and increasing or dropping the time if you don't get enough results).

answered Sep 19 '22 21:09

CL.

Related questions
                            
                                Android SQLite using db.query() for JOIN instead of rawquery()
                            
                                Django: using more than one database with inspectdb?
                            
                                How to sort text in sqlite3 with specified locale?
                            
                                Why can't you use SQLite ROWID as a Primary key?
                            
                                Unable to load DLL sqlite3 in Universal Windows App running on Mobile
                            
                                How do I get a list of indexed Columns for a given Table
                            
                                Java File Path Windows/Linux [closed]
                            
                                Proper use of callback function of sqlite3 in C++
                            
                                SQLite database update query with multiple where conditions in Android
                            
                                Android/Kotlin: Error: "Expecting a top level declaration > Task :app:buildInfoGeneratorDebug"
                            
                                Limit number of characters of a text field
                            
                                SQLAlchemy - "Dynamic Filter"
                            
                                Electron app cant find sqlite3 module
                            
                                In java, how can I delete a sqlite table?
                            
                                SQLite: autoincrement primary key questions
                            
                                DB with best inserts/sec performance? [closed]
                            
                                Deploying a Windows 8 Metro application that uses SQLite
                            
                                Using entity framework with both SQL Server and SQLite databases simultaneously
                            
                                Why Room entities don't work with immutable properties in Android
                            
                                Given the choice, what are the pros/cons of mixed-mode assemblies vs. separate interop DLLs?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With