I'm confused by the drastically different running times of the following two queries that produce identical output. The queries are running on Sqlite 3.7.9, on a table with about 4.5 million rows, and each produce ~50 rows of results.
Here are the queries:
% echo "SELECT DISTINCT acolumn FROM atable ORDER BY acolumn;" | time sqlite3 mydb
sqlite3 mydb 8.87s user 15.06s system 99% cpu 23.980 total
% echo "SELECT acolumn FROM (SELECT DISTINCT acolumn FROM atable) ORDER BY acolumn;" | time sqlite3 options
sqlite3 mydb 1.15s user 0.10s system 98% cpu 1.267 total
Shouldn't the performance of the two queries be closer? I understand that it may be the case that the query planner is performing the "sort" and "distinct" operations in different orders, but if so, does it need to? Or should it be able to figure out how to do it fastest?
Edit: as requested here is the output of the "EXPLAIN QUERY PLAN" command for each query.
For the first (monolithic) query:
0|0|0|SCAN TABLE atable (~1000000 rows)
0|0|0|USE TEMP B-TREE FOR DISTINCT
For the second (subquery) query:
1|0|0|SCAN TABLE atable (~1000000 rows)
1|0|0|USE TEMP B-TREE FOR DISTINCT
0|0|0|SCAN SUBQUERY 1 (~1000000 rows)
0|0|0|USE TEMP B-TREE FOR ORDER BY
Your first query orders the records first by inserting all of them into a sorted temporary table, and then implements the DISTINCT
by going through them and returning only those that are not identical to the previous one.
(This can be seen in the EXPLAIN
output shown below; the DISTINCT
actually got converted to a GROUP BY
, which behaves the same.)
Your second query is, in theory, identical to the first, but SQLite's query optimizer is rather simple and cannot prove that this conversion would be safe (as explained in the subquery flattening documentation).
Therefore, it is implemented by doing the DISTINCT
first, by inserting only any non-duplicates into a temporary table, and then doing the ORDER BY
with a second temporary table.
This second step is completely superfluous because the first temp table was already sorted, but this happens to be faster for your data anyway because you have so many duplicates that are never stored in either temp table.
In theory, your first query could be faster, because SQLite has already recognized that the DISTINCT
and ORDER BY
clauses can be implemented with the same sorted temporary table.
In practice, however, SQLite it is not smart enough to remember that the DISTINCT
implies that duplicates do not need to be stored in the temp table.
(This particular optimization might be added to SQLite if you ask nicely on the mailing list.)
$ sqlite3 mydb
sqlite> .explain
sqlite> explain SELECT DISTINCT acolumn FROM atable ORDER BY acolumn;
addr opcode p1 p2 p3 p4 p5 comment
---- ------------- ---- ---- ---- ------------- -- -------------
0 Trace 0 0 0 00
1 SorterOpen 1 2 0 keyinfo(1,BINARY) 00
2 Integer 0 3 0 00 clear abort flag
3 Integer 0 2 0 00 indicate accumulator empty
4 Null 0 6 6 00
5 Gosub 5 37 0 00
6 Goto 0 40 0 00
7 OpenRead 0 2 0 1 00 atable
8 Rewind 0 14 0 00
9 Column 0 0 8 00 atable.acolumn
10 Sequence 1 9 0 00
11 MakeRecord 8 2 10 00
12 SorterInsert 1 10 0 00
13 Next 0 9 0 01
14 Close 0 0 0 00
15 OpenPseudo 2 10 2 00
16 SorterSort 1 39 0 00 GROUP BY sort
17 SorterData 1 10 0 00
18 Column 2 0 7 20
19 Compare 6 7 1 keyinfo(1,BINARY) 00
20 Jump 21 25 21 00
21 Move 7 6 0 00
22 Gosub 4 32 0 00 output one row
23 IfPos 3 39 0 00 check abort flag
24 Gosub 5 37 0 00 reset accumulator
25 Column 2 0 1 00
26 Integer 1 2 0 00 indicate data in accumulator
27 SorterNext 1 17 0 00
28 Gosub 4 32 0 00 output final row
29 Goto 0 39 0 00
30 Integer 1 3 0 00 set abort flag
31 Return 4 0 0 00
32 IfPos 2 34 0 00 Groupby result generator entry point
33 Return 4 0 0 00
34 Copy 1 11 0 00
35 ResultRow 11 1 0 00
36 Return 4 0 0 00 end groupby result generator
37 Null 0 1 0 00
38 Return 5 0 0 00
39 Halt 0 0 0 00
40 Transaction 0 0 0 00
41 VerifyCookie 0 2 0 00
42 TableLock 0 2 0 atable 00
43 Goto 0 7 0 00
sqlite> explain SELECT acolumn FROM (SELECT DISTINCT acolumn FROM atable) ORDER BY acolumn;
addr opcode p1 p2 p3 p4 p5 comment
---- ------------- ---- ---- ---- ------------- -- -------------
0 Trace 0 0 0 00
1 Goto 0 39 0 00
2 Goto 0 17 0 00
3 OpenPseudo 0 3 1 01 coroutine for sqlite_subquery_DA7480_
4 Integer 0 2 0 01
5 OpenEphemeral 2 0 0 keyinfo(1,BINARY) 08
6 OpenRead 1 2 0 1 00 atable
7 Rewind 1 14 0 00
8 Column 1 0 3 00 atable.acolumn
9 Found 2 13 3 1 00
10 MakeRecord 3 1 4 00
11 IdxInsert 2 4 0 00
12 Yield 1 0 0 00
13 Next 1 8 0 01
14 Close 1 0 0 00
15 Integer 1 2 0 00
16 Yield 1 0 0 00 end sqlite_subquery_DA7480_
17 SorterOpen 3 3 0 keyinfo(1,BINARY) 00
18 Integer 2 1 0 00
19 Yield 1 0 0 00 next row of co-routine sqlite_subquery_DA7480_
20 If 2 29 0 00
21 Column 0 0 5 00 sqlite_subquery_DA7480_.acolumn
22 MakeRecord 5 1 6 00
23 Column 0 0 7 00 sqlite_subquery_DA7480_.acolumn
24 Sequence 3 8 0 00
25 Move 6 9 0 00
26 MakeRecord 7 3 10 00
27 SorterInsert 3 10 0 00
28 Goto 0 19 0 00
29 OpenPseudo 4 6 1 00
30 OpenPseudo 5 11 3 00
31 SorterSort 3 37 0 00
32 SorterData 3 11 0 00
33 Column 5 2 6 20
34 Column 4 0 5 20
35 ResultRow 5 1 0 00
36 SorterNext 3 32 0 00
37 Close 4 0 0 00
38 Halt 0 0 0 00
39 Transaction 0 0 0 00
40 VerifyCookie 0 2 0 00
41 TableLock 0 2 0 atable 00
42 Goto 0 2 0 00
Inside most DBMS, SQL statements are translated into relational algebra and then structured in an expression tree. The dbms then uses heuristics to optimise queries. One of the main heuristics is "Perform selection early" (p.46). I suppose the sqlite query planner does this as well, hence the differences in execution time.
Since the result of the subquery is much smaller (~50 rows opposed to 4.5 million), sorting, at the end of the expression tree, happens much faster. (Plain) Selecting isn't a very expensive process, running operations on a multitude of results is indeed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With