Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Postgres - Slow simple join with where-clause

I am having some trouble optimizing a query, and was hoping that someone here might be able to provide a few pointers.

I have two tables:

CREATE TABLE "blog_cached_posts" (
    "id" int4 NOT NULL DEFAULT nextval('blog_cached_posts_id_seq'::regclass),
    "title" varchar(255),
    "content" text,
    "content_encoded" text,
    "published_at" timestamp(6) NULL,
    "written_by" varchar(255),
    "link" varchar(255),
    "blog_id" int4,
    "guid" varchar(255),
    "created_at" timestamp(6) NULL,
    "updated_at" timestamp(6) NULL,
    "is_highlighted_post" bool DEFAULT false
)

With an index on blog_cached_posts.blog_id

CREATE TABLE "blogs" (
    "id" int4 NOT NULL DEFAULT nextval('blogs_id_seq'::regclass),
    "site_id" int4,
    "image_id" int4,
    "name" varchar(255),
    "description" text,
    "url" varchar(255),
    "rss_feed_url" varchar(255),
    "active" bool DEFAULT true,
    "created_at" timestamp(6) NULL,
    "updated_at" timestamp(6) NULL,
    "date_highlighted" date,
    "highlighted_category_feed_url" varchar(255),
    "position" int4
)

With an index on blogs.site_id

This is the query:

SELECT "blog_cached_posts".*
FROM "blog_cached_posts"
join blogs on blogs.id = blog_cached_posts.blog_id
WHERE ((published_at IS NOT NULL) AND (blogs.site_id = 80))
ORDER BY published_at desc
LIMIT 5

Here is an EXPLAIN ANALYZE:

Limit  (cost=9499.16..9499.17 rows=5 width=1853) (actual time=118.538..118.539 rows=5 loops=1)
  ->  Sort  (cost=9499.16..9626.31 rows=50861 width=1853) (actual time=118.536..118.537 rows=5 loops=1)
        Sort Key: blog_cached_posts.published_at
        Sort Method:  top-N heapsort  Memory: 33kB
        ->  Hash Join  (cost=16.25..8654.38 rows=50861 width=1853) (actual time=0.186..82.910 rows=48462 loops=1)
              Hash Cond: (blog_cached_posts.blog_id = blogs.id)
              ->  Seq Scan on blog_cached_posts  (cost=0.00..7930.94 rows=52954 width=1853) (actual time=0.042..56.635 rows=52950 loops=1)
                    Filter: (published_at IS NOT NULL)
              ->  Hash  (cost=13.21..13.21 rows=243 width=4) (actual time=0.135..0.135 rows=243 loops=1)
                    ->  Seq Scan on blogs  (cost=0.00..13.21 rows=243 width=4) (actual time=0.007..0.089 rows=243 loops=1)
                          Filter: (site_id = 80)
Total runtime: 118.591 ms

Is there any way to optimize this beyond the ~120ms it is currently taking?

EDIT

Here is what I ended up doing. (After reading the comment by @ypercube)

I added an index to blog_cached_posts:

CREATE INDEX \"blog_cached_posts_published_at\" ON \"public\".\"blog_cached_posts\" USING btree(published_at DESC NULLS LAST);
COMMENT ON INDEX \"public\".\"blog_cached_posts_published_at\" IS NULL;

And I changed the select to the following:

SELECT "blog_cached_posts".*
FROM "blog_cached_posts"
join blogs on blogs.id = blog_cached_posts.blog_id
WHERE published_at is not null and blogs.site_id = 80
ORDER BY published_at desc nulls last
LIMIT 5

This brought the execution time down to ~3ms.

Here is the new execution plan:

Limit  (cost=0.00..3.85 rows=5 width=1849) (actual time=0.027..0.047 rows=5 loops=1)
  ->  Nested Loop  (cost=0.00..39190.01 rows=50872 width=1849) (actual time=0.026..0.046 rows=5 loops=1)
        ->  Index Scan using blog_cached_posts_published_at on blog_cached_posts  (cost=0.00..24175.16 rows=52965 width=1849) (actual time=0.017..0.023 rows=5 loops=1)
              Filter: (published_at IS NOT NULL)
        ->  Index Scan using blogs_pkey on blogs  (cost=0.00..0.27 rows=1 width=4) (actual time=0.003..0.004 rows=1 loops=5)
              Index Cond: (blogs.id = blog_cached_posts.blog_id)
              Filter: (blogs.site_id = 80)
Total runtime: 0.086 ms
like image 535
Thomas Dippel Avatar asked Jun 17 '11 12:06

Thomas Dippel


2 Answers

Your issue is you cannot realistically use an index to pull the needed 5 posts outright. Hop over to index dos and donts for a moment.

(blog_id, published_at) (suggested in comments) would arguably help if were querying for a specific blog, but your most selective constraint in that query is the one on the site_id -- i.e. on a separate table altogether.

Seq Scan on blogs  (cost=0.00..13.21 rows=243 width=4) (actual time=0.007..0.089 rows=243 loops=1)
  Filter: (site_id = 80)

The above means that either you've no index on site_id, or that this particular site_id is all over the place and Postgres goes through the whole table since it'll need to open it regardless.

This then leads to multiple blog ids, and these are used to retrieve all of the relevant posts using a hash join. But since multiple blogs are involved, the best PG can do is to grab all applicable posts and subsequently top-n sort them.

Even if you were to change it so that you pass the blog ids directly in an IN() clause, an index on (blog_id, published_at) would not yield the needed rows in order. So it would still grab all posts for all of the applicable blogs, and top-n sort the mess.

One way to work around the problem would be to change your schema slightly:

CREATE TABLE "blog_cached_posts" (
    "id" int4 NOT NULL DEFAULT nextval('blog_cached_posts_id_seq'::regclass),
    ...
    "published_at" timestamp(6) NULL,
    "site_id" int4,
    "blog_id" int4,
    ...
)

Note the extra site_id. This allows to subsequently create an index on (site_id, published_at desc nulls last) and rewrite your query like:

SELECT "blog_cached_posts".*
FROM "blog_cached_posts"
WHERE site_id = 80
ORDER BY published_at desc nulls last
LIMIT 5

An alternative approach, per comments, is to maintain a latest_site_posts table using triggers. You'll end up with something similar to the above suggestion, but with a smaller table (i.e. less duplication of the site_id).

like image 55
Denis de Bernardy Avatar answered Oct 08 '22 21:10

Denis de Bernardy


As I mentioned in a comment, I would first try adding a simple index on published_at. It seems that if there wasn't the ORDER BY and LIMIT 5 clauses, the query would be quite efficient and all other needed indexes existed.

Therefore, adding an index on the field that is used for the final sorting is usually quite efficient.

As Dems explained in his answer:

Because the index ( blog_id, published_at ) is in a state that is good for the join, it turns out to be less good for the sort. On those grounds, you might see value in two indexes instead of one (on blog_id and published_at separately.)

like image 45
ypercubeᵀᴹ Avatar answered Oct 08 '22 21:10

ypercubeᵀᴹ