I am designing a mostly read-only database containing 300,000 documents with around 50,000 distinct tags, with each document having 15 tags on average. For now, the only query I care about is selecting all documents with no tag from a given set of tags. I'm only interested in the document_id
column (no other columns in the result).
My schema is essentially:
CREATE TABLE documents (
document_id SERIAL PRIMARY KEY,
title TEXT
);
CREATE TABLE tags (
tag_id SERIAL PRIMARY KEY,
name TEXT UNIQUE
);
CREATE TABLE documents_tags (
document_id INTEGER REFERENCES documents,
tag_id INTEGER REFERENCES tags,
PRIMARY KEY (document_id, tag_id)
);
I can write this query in Python by pre-computing the set of documents for a given tag, which reduces the problem to a few fast set operations:
In [17]: %timeit all_docs - (tags_to_docs[12345] | tags_to_docs[7654]) 100 loops, best of 3: 13.7 ms per loop
Translating the set operations to Postgres doesn't work that fast, however:
stuff=# SELECT document_id AS id FROM documents WHERE document_id NOT IN (
stuff(# SELECT documents_tags.document_id AS id FROM documents_tags
stuff(# WHERE documents_tags.tag_id IN (12345, 7654)
stuff(# );
document_id
---------------
...
Time: 201.476 ms
NOT IN
with EXCEPT
makes it even slower.document_id
and tag_id
in all three tables and another one on (document_id, tag_id)
.How do I speed up this query? Is there any way to pre-compute the mapping between like I did with Python, or am I thinking about this the wrong way?
Here is the result of an EXPLAIN ANALYZE
:
EXPLAIN ANALYZE
SELECT document_id AS id FROM documents
WHERE document_id NOT IN (
SELECT documents_tags.documents_id AS id FROM documents_tags
WHERE documents_tags.tag_id IN (12345, 7654)
);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on documents (cost=20280.27..38267.57 rows=83212 width=4) (actual time=176.760..300.214 rows=20036 loops=1)
Filter: (NOT (hashed SubPlan 1))
Rows Removed by Filter: 146388
SubPlan 1
-> Bitmap Heap Scan on documents_tags (cost=5344.61..19661.00 rows=247711 width=4) (actual time=32.964..89.514 rows=235093 loops=1)
Recheck Cond: (tag_id = ANY ('{12345,7654}'::integer[]))
Heap Blocks: exact=3300
-> Bitmap Index Scan on documents_tags__tag_id_index (cost=0.00..5282.68 rows=247711 width=0) (actual time=32.320..32.320 rows=243230 loops=1)
Index Cond: (tag_id = ANY ('{12345,7654}'::integer[]))
Planning time: 0.117 ms
Execution time: 303.289 ms
(11 rows)
Time: 303.790 ms
The only settings I changed from the default configuration were:
shared_buffers = 5GB
temp_buffers = 128MB
work_mem = 512MB
effective_cache_size = 16GB
Running Postgres 9.4.5 on a server with 64GB RAM.
1:- Check Indexes. 2:- There should be indexes on all fields used in the WHERE and JOIN portions of the SQL statement 3:- Limit Size of Your Working Data Set. 4:- Only Select Fields You select as Need. 5:- Remove Unnecessary Table and index 6:- Remove OUTER JOINS.
Your memory settings seem reasonable for a 64GB server - except maybe work_mem = 512MB
. That's high. Your queries are not particularly complex and your tables are not that big.
4.5 million rows (300k x 15) in the simple junction table documents_tags
should occupy ~ 156 MB and the PK another 96 MB. For your query you typically don't need to read the whole table, just small parts of the index. For "mostly read-only" like you have, you should see index-only scans on the index of the PK exclusively. You don't need nearly as much work_mem
- which may not matter much - except where you have many concurrent queries. Quoting the manual:
... several running sessions could be doing such operations concurrently. Therefore, the total memory used could be many times the value of
work_mem
; it is necessary to keep this fact in mind when choosing the value.
Setting work_mem
too high may actually impair performance:
I suggest to reduce work_mem
to 128 MB or less to avoid possible memory starvation- unless you have other common queries that require more. You can always set it higher locally for special queries.
There are several other angles to optimize read performance:
All of this may help a little. But the key problem is this:
PRIMARY KEY (document_id, tag_id)
300k documents, 2 tags to exclude. Ideally, you have an index with tag_id
as leading column and document_id
as 2nd. With an index on just (tag_id)
you can't get index-only scans. If this query is your only use case, change your PK as demonstrated below.
Or probably even better: you can create an additional plain index on (tag_id, document_id)
if you need both - and drop the two other indexes on documents_tags
on just (tag_id)
and (document_id)
. They offer nothing over the two multicolumn indexes. The remaining 2 indexes (as opposed to 3 indexes before) are smaller and superior in every way. Rationale:
While being at it, I suggest to also CLUSTER
the table using the new PK, all in one transaction, possibly with some extra maintenance_work_mem
locally:
BEGIN;
SET LOCAL maintenance_work_mem = '256MB';
ALTER TABLE documents_tags
DROP CONSTRAINT documents_tags_pkey
, ADD PRIMARY KEY (tag_id, document_id); -- tag_id first.
CLUSTER documents_tags USING documents_tags_pkey;
COMMIT;
Don't forget to:
ANALYZE documents_tags;
The query itself is run-of-the-mill. Here are the 4 standard techniques:
NOT IN
is - to quote myself:
Only good for small sets without NULL values
Your use case exactly: all involved columns NOT NULL
and your list of excluded items is very short. Your original query is a hot contender.
NOT EXISTS
and LEFT JOIN / IS NULL
are always hot contenders. Both have been suggested in other answers. LEFT JOIN
has to be an actual LEFT [OUTER] JOIN
, though.
EXCEPT ALL
would be shortest, but often not as fast.
SELECT document_id
FROM documents d
WHERE document_id NOT IN (
SELECT document_id -- no need for column alias, only value is relevant
FROM documents_tags
WHERE tag_id IN (12345, 7654)
);
2. NOT EXISTS
SELECT document_id
FROM documents d
WHERE NOT EXISTS (
SELECT 1
FROM documents_tags
WHERE document_id = d.document_id
AND tag_id IN (12345, 7654)
);
3. LEFT JOIN / IS NULL
SELECT d.document_id
FROM documents d
LEFT JOIN documents_tags dt ON dt.document_id = d.document_id
AND dt.tag_id IN (12345, 7654)
WHERE dt.document_id IS NULL;
4. EXCEPT ALL
SELECT document_id
FROM documents
EXCEPT ALL -- ALL, to keep duplicate rows and make it faster
SELECT document_id
FROM documents_tags
WHERE tag_id IN (12345, 7654);
I ran a quick benchmark on my old laptop with 4 GB RAM and Postgres 9.5.3 to put my theories to the test:
SET random_page_cost = 1.1;
SET work_mem = '128MB';
CREATE SCHEMA temp;
SET search_path = temp, public;
CREATE TABLE documents (
document_id serial PRIMARY KEY,
title text
);
-- CREATE TABLE tags ( ... -- actually irrelevant for this query
CREATE TABLE documents_tags (
document_id integer REFERENCES documents,
tag_id integer -- REFERENCES tags -- irrelevant for test
-- no PK yet, to test seq scan
-- it's also faster to create the PK after filling the big table
);
INSERT INTO documents (title)
SELECT 'some dummy title ' || g
FROM generate_series(1, 300000) g;
INSERT INTO documents_tags(document_id, tag_id)
SELECT i.*
FROM documents d
CROSS JOIN LATERAL (
SELECT DISTINCT d.document_id, ceil(random() * 50000)::int
FROM generate_series (1,15)) i;
ALTER TABLE documents_tags ADD PRIMARY KEY (document_id, tag_id); -- your current index
ANALYZE documents_tags;
ANALYZE documents;
Note that rows in documents_tags
are physically clustered by document_id
due to the way I filled the table - which is likely your current situation as well.
3 test runs with each of the 4 queries, best of 5 every time to exclude caching effects.
Test 1: With documents_tags_pkey
like you have it. Index and physical order of rows are bad for our query.
Test 2: Recreate the PK on (tag_id, document_id)
like suggested.
Test 3: CLUSTER
on new PK.
Execution time of EXPLAIN ANALYZE
in ms:
time in ms | Test 1 | Test 2 | Test 3 1. NOT IN | 654 | 70 | 71 -- winner! 2. NOT EXISTS | 684 | 103 | 97 3. LEFT JOIN | 685 | 98 | 99 4. EXCEPT ALL | 833 | 255 | 250
Key element is the right index with leading tag_id
- for queries involving few tag_id
and many document_id
.
To be precise, it's not important that there are more distinct document_id
than tag_id
. This could be the other way round as well. Btree indexes basically perform the same with any order of columns. It's the fact that the most selective predicate in your query filters on tag_id
. And that's faster on the leading index column(s).
The winning query for few tag_id
to exclude is your original with NOT IN
.
NOT EXISTS
and LEFT JOIN / IS NULL
result in the same query plan. For more than a couple of dozen IDs, I expect these to scale better ...
In a read-only situation you'll see index-only scans exclusively, so the physical order of rows in the table becomes irrelevant. Hence, test 3 did not bring any more improvements.
If writes to the table happen and autovacuum can't keep up, you'll see (bitmap) index scans. Physical clustering is important for those.
Use an outer join, with the tag condition on the join, keeping only missed joins to return where none of the specified tags match:
select d.id
from documents d
join documents_tags t on t.document_id = d.id
and t.tag_id in (12345, 7654)
where t.document_id is null
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With