I am trying to figure out the very best way, (probably doesn't matter in this case) to find the rows of one table, based on the existence of a flag, and an relational id in a row in another table. here are the schemas: <pre class="prettyprint"><code> CREATE TABLE files ( id INTEGER PRIMARY KEY, dirty INTEGER NOT NULL); CREATE TABLE resume_points ( id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL , scan_file_id INTEGER NOT NULL ); </code></pre> I am using SQLite3 there files table will be very large, 10K-5M rows typically. the resume_points will be small <10K with only 1-2 distinct <code>scan_file_id</code>'s so my first thought was: <pre class="prettyprint"><code>select distinct files.* from resume_points inner join files on resume_points.scan_file_id=files.id where files.dirty = 1; </code></pre> a coworker suggested turning the join around: <pre class="prettyprint"><code>select distinct files.* from files inner join resume_points on files.id=resume_points.scan_file_id where files.dirty = 1; </code></pre> then I thought since we know that the number of distinct <code>scan_file_id</code>'s will be so small, perhaps a subselect would be optimal (in this rare instance): <pre class="prettyprint"><code>select * from files where id in (select distinct scan_file_id from resume_points); </code></pre> the <code>explain</code> outputs had the following rows: 42, 42, and 48 respectively.

TL;DR: The best query and index is: <pre class="prettyprint"><code>create index uniqueFiles on resume_points (scan_file_id); select * from (select distinct scan_file_id from resume_points) d join files on d.scan_file_id = files.id and files.dirty = 1; </code></pre> Since I typically work with SQL Server, at first I thought that surely the query optimizer would find the optimal execution plan for such a simple query regardless of which way you write these equivalent SQL statements. So I downloaded SQLite, and started playing around. Much to my surprise, there was a huge difference in performance. Here's the setup code: <pre class="prettyprint"><code>CREATE TABLE files ( id INTEGER PRIMARY KEY autoincrement, dirty INTEGER NOT NULL); CREATE TABLE resume_points ( id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL , scan_file_id INTEGER NOT NULL ); insert into files (dirty) values (0); insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into resume_points (scan_file_id) select (select abs(random() % 8000000)) from files limit 5000; insert into resume_points (scan_file_id) select (select abs(random() % 8000000)) from files limit 5000; </code></pre> I considered two indices: <pre class="prettyprint"><code>create index dirtyFiles on files (dirty, id); create index uniqueFiles on resume_points (scan_file_id); create index fileLookup on files (id); </code></pre> Below are the queries I tried and the execution times on my i5 laptop. The database file size is only about 200MB since it doesn't have any other data. <pre class="prettyprint"><code>select distinct files.* from resume_points inner join files on resume_points.scan_file_id=files.id where files.dirty = 1; 4.3 - 4.5ms with and without index select distinct files.* from files inner join resume_points on files.id=resume_points.scan_file_id where files.dirty = 1; 4.4 - 4.7ms with and without index select * from (select distinct scan_file_id from resume_points) d join files on d.scan_file_id = files.id and files.dirty = 1; 2.0 - 2.5ms with uniqueFiles 2.6-2.9ms without uniqueFiles select * from files where id in (select distinct scan_file_id from resume_points) and dirty = 1; 2.1 - 2.5ms with uniqueFiles 2.6-3ms without uniqueFiles SELECT f.* FROM resume_points rp INNER JOIN files f on rp.scan_file_id = f.id WHERE f.dirty = 1 GROUP BY f.id 4500 - 6190 ms with uniqueFiles 8.8-9.5 ms without uniqueFiles 14000 ms with uniqueFiles and fileLookup select * from files where exists ( select * from resume_points where files.id = resume_points.scan_file_id) and dirty = 1; 8400 ms with uniqueFiles 7400 ms without uniqueFiles </code></pre> It looks like SQLite's query optimizer isn't very advanced at all. The best queries first reduce resume_points to a small number of rows (Two in the test case. The OP said it would be 1-2.), and then look up the file to see if it is dirty or not. <code>dirtyFiles</code> index didn't make much of a difference for any of the files. I think it may be because of the way the data is arranged in the test tables. It may make a difference in production tables. However, the difference is not too great as there will be less than a handful of lookups. <code>uniqueFiles</code> does make a difference since it can reduce 10000 rows of resume_points to 2 rows without scanning through most of them. <code>fileLookup</code> did make some queries slightly faster, but not enough to significantly change the results. Notably it made group by very slow. In conclusion, reduce the result set early to make the biggest differences.

SQLite3 query optimization join vs subselect

Q: Which query is faster subquery or join?

I won't leave you in suspense, between Joins and Subqueries, joins tend to execute faster. In fact, query retrieval time using joins will almost always outperform one that employs a subquery. The reason is that joins mitigate the processing burden on the database by replacing multiple queries with one join query.

Q: Which is better nested query or join?

The more data tables have, the subqueries are slower. The less data tables have, the subqueries have equivalent speed as joins. The subqueries are simpler, easier to understand, and easier to read.

Q: When to use a subquery VS join?

If you need to combine related information from different rows within a table, then you can join the table with itself. Use subqueries when the result that you want requires more than one query and each subquery provides a subset of the table involved in the query.

Q: Are joins faster than where clause?

“Is there a performance difference between putting the JOIN conditions in the ON clause or the WHERE clause in MySQL?” No, there's no difference. The following queries are algebraically equivalent inside MySQL and will have the same execution plan.

Q: What is faster a correlated subquery or an inner join?

"Correlated subqueries" are faster than Normal joins.

Tags:

sql

database

sqlite

query-optimization

I am trying to figure out the very best way, (probably doesn't matter in this case) to find the rows of one table, based on the existence of a flag, and an relational id in a row in another table.

here are the schemas:

    CREATE TABLE files (
id INTEGER PRIMARY KEY,
dirty INTEGER NOT NULL);

    CREATE TABLE resume_points (
id INTEGER PRIMARY KEY  AUTOINCREMENT  NOT NULL ,
scan_file_id INTEGER NOT NULL );

I am using SQLite3

there files table will be very large, 10K-5M rows typically. the resume_points will be small <10K with only 1-2 distinct scan_file_id's

so my first thought was:

select distinct files.* from resume_points inner join files
on resume_points.scan_file_id=files.id where files.dirty = 1;

a coworker suggested turning the join around:

select distinct files.* from files inner join resume_points
on files.id=resume_points.scan_file_id where files.dirty = 1;

then I thought since we know that the number of distinct scan_file_id's will be so small, perhaps a subselect would be optimal (in this rare instance):

select * from files where id in (select distinct scan_file_id from resume_points);

the explain outputs had the following rows: 42, 42, and 48 respectively.

517

asked Jun 28 '13 22:06

Grady Player

1 Answers

TL;DR: The best query and index is:

create index uniqueFiles on resume_points (scan_file_id);
select * from (select distinct scan_file_id from resume_points) d join files on d.scan_file_id = files.id and files.dirty = 1;

Since I typically work with SQL Server, at first I thought that surely the query optimizer would find the optimal execution plan for such a simple query regardless of which way you write these equivalent SQL statements. So I downloaded SQLite, and started playing around. Much to my surprise, there was a huge difference in performance.

Here's the setup code:

CREATE TABLE files (
id INTEGER PRIMARY KEY autoincrement,
dirty INTEGER NOT NULL);

CREATE TABLE resume_points (
id INTEGER PRIMARY KEY  AUTOINCREMENT  NOT NULL ,
scan_file_id INTEGER NOT NULL );

insert into files (dirty) values (0);
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;

insert into resume_points (scan_file_id) select (select abs(random() % 8000000)) from files limit 5000;

insert into resume_points (scan_file_id) select (select abs(random() % 8000000)) from files limit 5000;

I considered two indices:

create index dirtyFiles on files (dirty, id);
create index uniqueFiles on resume_points (scan_file_id);
create index fileLookup on files (id);

Below are the queries I tried and the execution times on my i5 laptop. The database file size is only about 200MB since it doesn't have any other data.

select distinct files.* from resume_points inner join files on resume_points.scan_file_id=files.id where files.dirty = 1;
4.3 - 4.5ms with and without index

select distinct files.* from files inner join resume_points on files.id=resume_points.scan_file_id where files.dirty = 1;
4.4 - 4.7ms with and without index

select * from (select distinct scan_file_id from resume_points) d join files on d.scan_file_id = files.id and files.dirty = 1;
2.0 - 2.5ms with uniqueFiles
2.6-2.9ms without uniqueFiles

select * from files where id in (select distinct scan_file_id from resume_points) and dirty = 1;
2.1 - 2.5ms with uniqueFiles
2.6-3ms without uniqueFiles

SELECT f.* FROM resume_points rp INNER JOIN files f on rp.scan_file_id = f.id
WHERE f.dirty = 1 GROUP BY f.id
4500 - 6190 ms with uniqueFiles
8.8-9.5 ms without uniqueFiles
    14000 ms with uniqueFiles and fileLookup

select * from files where exists (
select * from resume_points where files.id = resume_points.scan_file_id) and dirty = 1;
8400 ms with uniqueFiles
7400 ms without uniqueFiles

It looks like SQLite's query optimizer isn't very advanced at all. The best queries first reduce resume_points to a small number of rows (Two in the test case. The OP said it would be 1-2.), and then look up the file to see if it is dirty or not. dirtyFiles index didn't make much of a difference for any of the files. I think it may be because of the way the data is arranged in the test tables. It may make a difference in production tables. However, the difference is not too great as there will be less than a handful of lookups. uniqueFiles does make a difference since it can reduce 10000 rows of resume_points to 2 rows without scanning through most of them. fileLookup did make some queries slightly faster, but not enough to significantly change the results. Notably it made group by very slow. In conclusion, reduce the result set early to make the biggest differences.

172

answered Nov 11 '22 04:11

John Tseng

Related questions
                            
                                SQL: Is it possible to 'group by' according to 'like' function's results?
                            
                                C# Prepared Statements - @ sign (at / strudel sign) queries
                            
                                Eliminate and reduce overlapping date ranges
                            
                                Fastest postgreSQL equivalent to MySQL UTC_DATE() (getting UTC date)?
                            
                                select mysql missing columns in php
                            
                                Group By Except For Certain Value
                            
                                Sum results from two select statements
                            
                                is "where (ParamID = @ParamID) OR (@ParamID = -1)" a good practice in sql selection
                            
                                Lock table while inserting
                            
                                How to reorder items in a table
                            
                                SQL select group query
                            
                                Role of selectivity in index scan/seek
                            
                                Most efficient way to save way points and do comparisons?
                            
                                SQL Server - returning xml child nodes for xml column
                            
                                How do I programmatically run a complex query on an as400?
                            
                                Postgres 9.2 PL/pgSQL simple update in loop
                            
                                How to use subquery into "from" clause in hibernate?
                            
                                Normalizing an extremely big table
                            
                                Performance of regexp_replace vs translate in Oracle?
                            
                                Why when I insert a DateTime null I have "0001-01-01" in SQL Server?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With