I am trying to figure out the very best way, (probably doesn't matter in this case) to find the rows of one table, based on the existence of a flag, and an relational id in a row in another table.
here are the schemas:
CREATE TABLE files (
id INTEGER PRIMARY KEY,
dirty INTEGER NOT NULL);
CREATE TABLE resume_points (
id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL ,
scan_file_id INTEGER NOT NULL );
I am using SQLite3
there files table will be very large, 10K-5M rows typically.
the resume_points will be small <10K with only 1-2 distinct scan_file_id
's
so my first thought was:
select distinct files.* from resume_points inner join files
on resume_points.scan_file_id=files.id where files.dirty = 1;
a coworker suggested turning the join around:
select distinct files.* from files inner join resume_points
on files.id=resume_points.scan_file_id where files.dirty = 1;
then I thought since we know that the number of distinct scan_file_id
's will be so small, perhaps a subselect would be optimal (in this rare instance):
select * from files where id in (select distinct scan_file_id from resume_points);
the explain
outputs had the following rows: 42, 42, and 48 respectively.
I won't leave you in suspense, between Joins and Subqueries, joins tend to execute faster. In fact, query retrieval time using joins will almost always outperform one that employs a subquery. The reason is that joins mitigate the processing burden on the database by replacing multiple queries with one join query.
The more data tables have, the subqueries are slower. The less data tables have, the subqueries have equivalent speed as joins. The subqueries are simpler, easier to understand, and easier to read.
If you need to combine related information from different rows within a table, then you can join the table with itself. Use subqueries when the result that you want requires more than one query and each subquery provides a subset of the table involved in the query.
“Is there a performance difference between putting the JOIN conditions in the ON clause or the WHERE clause in MySQL?” No, there's no difference. The following queries are algebraically equivalent inside MySQL and will have the same execution plan.
"Correlated subqueries" are faster than Normal joins.
TL;DR: The best query and index is:
create index uniqueFiles on resume_points (scan_file_id);
select * from (select distinct scan_file_id from resume_points) d join files on d.scan_file_id = files.id and files.dirty = 1;
Since I typically work with SQL Server, at first I thought that surely the query optimizer would find the optimal execution plan for such a simple query regardless of which way you write these equivalent SQL statements. So I downloaded SQLite, and started playing around. Much to my surprise, there was a huge difference in performance.
Here's the setup code:
CREATE TABLE files (
id INTEGER PRIMARY KEY autoincrement,
dirty INTEGER NOT NULL);
CREATE TABLE resume_points (
id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL ,
scan_file_id INTEGER NOT NULL );
insert into files (dirty) values (0);
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into resume_points (scan_file_id) select (select abs(random() % 8000000)) from files limit 5000;
insert into resume_points (scan_file_id) select (select abs(random() % 8000000)) from files limit 5000;
I considered two indices:
create index dirtyFiles on files (dirty, id);
create index uniqueFiles on resume_points (scan_file_id);
create index fileLookup on files (id);
Below are the queries I tried and the execution times on my i5 laptop. The database file size is only about 200MB since it doesn't have any other data.
select distinct files.* from resume_points inner join files on resume_points.scan_file_id=files.id where files.dirty = 1;
4.3 - 4.5ms with and without index
select distinct files.* from files inner join resume_points on files.id=resume_points.scan_file_id where files.dirty = 1;
4.4 - 4.7ms with and without index
select * from (select distinct scan_file_id from resume_points) d join files on d.scan_file_id = files.id and files.dirty = 1;
2.0 - 2.5ms with uniqueFiles
2.6-2.9ms without uniqueFiles
select * from files where id in (select distinct scan_file_id from resume_points) and dirty = 1;
2.1 - 2.5ms with uniqueFiles
2.6-3ms without uniqueFiles
SELECT f.* FROM resume_points rp INNER JOIN files f on rp.scan_file_id = f.id
WHERE f.dirty = 1 GROUP BY f.id
4500 - 6190 ms with uniqueFiles
8.8-9.5 ms without uniqueFiles
14000 ms with uniqueFiles and fileLookup
select * from files where exists (
select * from resume_points where files.id = resume_points.scan_file_id) and dirty = 1;
8400 ms with uniqueFiles
7400 ms without uniqueFiles
It looks like SQLite's query optimizer isn't very advanced at all. The best queries first reduce resume_points to a small number of rows (Two in the test case. The OP said it would be 1-2.), and then look up the file to see if it is dirty or not. dirtyFiles
index didn't make much of a difference for any of the files. I think it may be because of the way the data is arranged in the test tables. It may make a difference in production tables. However, the difference is not too great as there will be less than a handful of lookups. uniqueFiles
does make a difference since it can reduce 10000 rows of resume_points to 2 rows without scanning through most of them. fileLookup
did make some queries slightly faster, but not enough to significantly change the results. Notably it made group by very slow. In conclusion, reduce the result set early to make the biggest differences.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With