Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Most recent record, prior to date, by category: optimising

I have a table in a PostgreSQL database called feeds_up. It looks like:

| feed_url | isup | hasproblems | observed timestamp with tz    | id (pk)|
| http://b.| t    | f           | 2013-02-27 16:34:46.327401+11 | 15235  |
| http://f.| f    | t           | 2013-02-27 16:31:25.415126+11 | 15236  |

It has something like 300k rows, growing at ~20 rows every five minutes. I have a query which runs very often (every page load)

select distinct on (feed_url) feed_url, isUp, hasProblems
    from feeds_up
    where observed <= '2013-02-27T05:38:00.000Z'
    order by feed_url, observed desc;

I put an example time there, that time is parametrized. The explain analyse is on explain.depesz.com. It takes about 8s. Crazy!

There's only about 20 unique values for feed_url, so this seems really inefficient. I thought I'd be stupid and try a FOR loop in a function.

CREATE OR REPLACE FUNCTION feedStatusAtDate(theTime timestamp with time zone) RETURNS SETOF feeds_up AS
    url feeds_list%rowtype;
FOR url IN SELECT * FROM feeds_list 
    WHERE observed <= theTime
    AND feed_url = url.feed_url
    ORDER BY observed DESC LIMIT 1;
$BODY$ language plpgsql;

select * from feedStatusAtDate('2013-02-27T05:38:00.000Z');

That takes just 307ms!

Using a FOR loop in SQL rubs me the wrong way, how can I make a nice query—like the first one—that is efficient? Is that possible? Or is this the kind of thing where a FOR loop really is best?


Postgres version: PostgreSQL 9.1.5 on i686-pc-linux-gnu, compiled by gcc (SUSE Linux) 4.3.4 [gcc-4_3-branch revision 152973], 32-bit

Indexes on feeds_up:

CREATE INDEX feeds_up_url
  ON feeds_up
  USING btree
  (feed_url COLLATE pg_catalog."default");

CREATE INDEX feeds_up_url_observed
  ON feeds_up
  USING btree
  (feed_url COLLATE pg_catalog."default", observed DESC);

CREATE INDEX feeds_up_observed
  ON public.feeds_up
  USING btree
  (observed DESC);
like image 367
Cathy Avatar asked Nov 12 '22 07:11


1 Answers

Assuming that "id" is serial and always sequential, you might simplify by finding the MAX(id) for each feed_url in a subquery and then pull in the rest of the data as follows:

SELECT fu.feed_url, fu.isup, fu.hasproblems, fu.observed
FROM feeds_up fu
  SELECT feed_url, max(id)  AS id FROM feeds_up
  WHERE observed <= '2013-03-27T05:38:00.000Z'
  GROUP BY feed_url
) AS q USING (id);
ORDER BY fu.feed_url, fu.observed desc;

I did a quick test and this works very efficiently utilizing only an index on "observed".


To use "observed" instead of "id" (since records may not insert in order) you can modify above query as follows:

SELECT DISTINCT ON (fu.feed_url) fu.feed_url, fu.isup, fu.hasproblems, fu.observed
FROM feeds_up fu
  SELECT feed_url, max(observed) as observed FROM feeds_up
  WHERE observed <= '2013-03-27T05:38:00.000Z'
  GROUP BY feed_url
) AS q USING (feed_url, observed)
ORDER BY fu.feed_url, fu.observed desc;

On my system this ran in nearly the same time with the one index on "observed". YMMV

like image 170
marcj Avatar answered Nov 15 '22 07:11
