I am designing a mostly read-only database containing 300,000 documents with around 50,000 distinct tags, with each document having 15 tags on average. For now, the only query I care about is selecting all documents with no tag from a given set of tags. I'm only interested in the <code>document_id</code> column (no other columns in the result). My schema is essentially: <pre class="prettyprint"><code>CREATE TABLE documents ( document_id SERIAL PRIMARY KEY, title TEXT ); CREATE TABLE tags ( tag_id SERIAL PRIMARY KEY, name TEXT UNIQUE ); CREATE TABLE documents_tags ( document_id INTEGER REFERENCES documents, tag_id INTEGER REFERENCES tags, PRIMARY KEY (document_id, tag_id) ); </code></pre> I can write this query in Python by pre-computing the set of documents for a given tag, which reduces the problem to a few fast set operations: <blockquote> <pre class="prettyprint"><code>In [17]: %timeit all_docs - (tags_to_docs[12345] | tags_to_docs[7654]) 100 loops, best of 3: 13.7 ms per loop </code></pre> </blockquote> Translating the set operations to Postgres doesn't work that fast, however: <pre class="prettyprint"><code>stuff=# SELECT document_id AS id FROM documents WHERE document_id NOT IN ( stuff(# SELECT documents_tags.document_id AS id FROM documents_tags stuff(# WHERE documents_tags.tag_id IN (12345, 7654) stuff(# ); document_id --------------- ... Time: 201.476 ms </code></pre> <ul> <li>Replacing <code>NOT IN</code> with <code>EXCEPT</code> makes it even slower.</li> <li>I have btree indexes on <code>document_id</code> and <code>tag_id</code> in all three tables and another one on <code>(document_id, tag_id)</code>.</li> <li>The default memory limits on Postgres' process have been increased significantly, so I don't think Postgres is misconfigured.</li> </ul> How do I speed up this query? Is there any way to pre-compute the mapping between like I did with Python, or am I thinking about this the wrong way? <hr> Here is the result of an <code>EXPLAIN ANALYZE</code>: <pre class="prettyprint"><code>EXPLAIN ANALYZE SELECT document_id AS id FROM documents WHERE document_id NOT IN ( SELECT documents_tags.documents_id AS id FROM documents_tags WHERE documents_tags.tag_id IN (12345, 7654) ); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------- Seq Scan on documents (cost=20280.27..38267.57 rows=83212 width=4) (actual time=176.760..300.214 rows=20036 loops=1) Filter: (NOT (hashed SubPlan 1)) Rows Removed by Filter: 146388 SubPlan 1 -> Bitmap Heap Scan on documents_tags (cost=5344.61..19661.00 rows=247711 width=4) (actual time=32.964..89.514 rows=235093 loops=1) Recheck Cond: (tag_id = ANY ('{12345,7654}'::integer[])) Heap Blocks: exact=3300 -> Bitmap Index Scan on documents_tags__tag_id_index (cost=0.00..5282.68 rows=247711 width=0) (actual time=32.320..32.320 rows=243230 loops=1) Index Cond: (tag_id = ANY ('{12345,7654}'::integer[])) Planning time: 0.117 ms Execution time: 303.289 ms (11 rows) Time: 303.790 ms </code></pre> The only settings I changed from the default configuration were: <pre class="prettyprint"><code>shared_buffers = 5GB temp_buffers = 128MB work_mem = 512MB effective_cache_size = 16GB </code></pre> Running Postgres 9.4.5 on a server with 64GB RAM.

Use an outer join, with the tag condition on the join, keeping only missed joins to return where none of the specified tags match: <pre class="prettyprint"><code>select d.id from documents d join documents_tags t on t.document_id = d.id and t.tag_id in (12345, 7654) where t.document_id is null </code></pre>

Optimizing a row exclusion query

Tags:

performance

sql

indexing

postgresql

postgresql-performance

I am designing a mostly read-only database containing 300,000 documents with around 50,000 distinct tags, with each document having 15 tags on average. For now, the only query I care about is selecting all documents with no tag from a given set of tags. I'm only interested in the document_id column (no other columns in the result).

My schema is essentially:

Optimizing a row exclusion query

Tags:

performance

sql

indexing

postgresql

postgresql-performance

user2472188

People also ask

2 Answers

Optimize setup for read performance

Key problem: leading index column

Queries

Benchmark

Test setup

Test

Conclusions

Erwin Brandstetter

Bohemian

Related questions

Recent Activity

Donate For Us