Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most recent record, prior to date, by category: optimising

I have a table in a PostgreSQL database called feeds_up. It looks like:

| feed_url | isup | hasproblems | observed timestamp with tz    | id (pk)|
|----------|------|-------------|-------------------------------|--------|
| http://b.| t    | f           | 2013-02-27 16:34:46.327401+11 | 15235  |
| http://f.| f    | t           | 2013-02-27 16:31:25.415126+11 | 15236  |

It has something like 300k rows, growing at ~20 rows every five minutes. I have a query which runs very often (every page load)

select distinct on (feed_url) feed_url, isUp, hasProblems
    from feeds_up
    where observed <= '2013-02-27T05:38:00.000Z'
    order by feed_url, observed desc;

I put an example time there, that time is parametrized. The explain analyse is on explain.depesz.com. It takes about 8s. Crazy!

There's only about 20 unique values for feed_url, so this seems really inefficient. I thought I'd be stupid and try a FOR loop in a function.

CREATE OR REPLACE FUNCTION feedStatusAtDate(theTime timestamp with time zone) RETURNS SETOF feeds_up AS
$BODY$
DECLARE
    url feeds_list%rowtype;
BEGIN
FOR url IN SELECT * FROM feeds_list 
LOOP
    RETURN QUERY SELECT * FROM feeds_up
    WHERE observed <= theTime
    AND feed_url = url.feed_url
    ORDER BY observed DESC LIMIT 1;
END LOOP;
END;
$BODY$ language plpgsql;

select * from feedStatusAtDate('2013-02-27T05:38:00.000Z');

That takes just 307ms!

Using a FOR loop in SQL rubs me the wrong way, how can I make a nice query—like the first one—that is efficient? Is that possible? Or is this the kind of thing where a FOR loop really is best?

ETA

Postgres version: PostgreSQL 9.1.5 on i686-pc-linux-gnu, compiled by gcc (SUSE Linux) 4.3.4 [gcc-4_3-branch revision 152973], 32-bit

Indexes on feeds_up:

CREATE INDEX feeds_up_url
  ON feeds_up
  USING btree
  (feed_url COLLATE pg_catalog."default");

CREATE INDEX feeds_up_url_observed
  ON feeds_up
  USING btree
  (feed_url COLLATE pg_catalog."default", observed DESC);

CREATE INDEX feeds_up_observed
  ON public.feeds_up
  USING btree
  (observed DESC);
like image 367
Cathy Avatar asked Nov 12 '22 07:11

Cathy


1 Answers

Assuming that "id" is serial and always sequential, you might simplify by finding the MAX(id) for each feed_url in a subquery and then pull in the rest of the data as follows:

SELECT fu.feed_url, fu.isup, fu.hasproblems, fu.observed
FROM feeds_up fu
JOIN
(
  SELECT feed_url, max(id)  AS id FROM feeds_up
  WHERE observed <= '2013-03-27T05:38:00.000Z'
  GROUP BY feed_url
) AS q USING (id);
ORDER BY fu.feed_url, fu.observed desc;

I did a quick test and this works very efficiently utilizing only an index on "observed".

UPDATE:

To use "observed" instead of "id" (since records may not insert in order) you can modify above query as follows:

SELECT DISTINCT ON (fu.feed_url) fu.feed_url, fu.isup, fu.hasproblems, fu.observed
FROM feeds_up fu
JOIN
(
  SELECT feed_url, max(observed) as observed FROM feeds_up
  WHERE observed <= '2013-03-27T05:38:00.000Z'
  GROUP BY feed_url
) AS q USING (feed_url, observed)
ORDER BY fu.feed_url, fu.observed desc;

On my system this ran in nearly the same time with the one index on "observed". YMMV

like image 170
marcj Avatar answered Nov 15 '22 07:11

marcj