I have a table in a PostgreSQL database called feeds_up
. It looks like:
| feed_url | isup | hasproblems | observed timestamp with tz | id (pk)|
|----------|------|-------------|-------------------------------|--------|
| http://b.| t | f | 2013-02-27 16:34:46.327401+11 | 15235 |
| http://f.| f | t | 2013-02-27 16:31:25.415126+11 | 15236 |
It has something like 300k rows, growing at ~20 rows every five minutes. I have a query which runs very often (every page load)
select distinct on (feed_url) feed_url, isUp, hasProblems
from feeds_up
where observed <= '2013-02-27T05:38:00.000Z'
order by feed_url, observed desc;
I put an example time there, that time is parametrized. The explain analyse is on explain.depesz.com. It takes about 8s. Crazy!
There's only about 20 unique values for feed_url
, so this seems really inefficient. I thought I'd be stupid and try a FOR loop in a function.
CREATE OR REPLACE FUNCTION feedStatusAtDate(theTime timestamp with time zone) RETURNS SETOF feeds_up AS
$BODY$
DECLARE
url feeds_list%rowtype;
BEGIN
FOR url IN SELECT * FROM feeds_list
LOOP
RETURN QUERY SELECT * FROM feeds_up
WHERE observed <= theTime
AND feed_url = url.feed_url
ORDER BY observed DESC LIMIT 1;
END LOOP;
END;
$BODY$ language plpgsql;
select * from feedStatusAtDate('2013-02-27T05:38:00.000Z');
That takes just 307ms!
Using a FOR loop in SQL rubs me the wrong way, how can I make a nice query—like the first one—that is efficient? Is that possible? Or is this the kind of thing where a FOR loop really is best?
ETA
Postgres version: PostgreSQL 9.1.5 on i686-pc-linux-gnu, compiled by gcc (SUSE Linux) 4.3.4 [gcc-4_3-branch revision 152973], 32-bit
Indexes on feeds_up:
CREATE INDEX feeds_up_url
ON feeds_up
USING btree
(feed_url COLLATE pg_catalog."default");
CREATE INDEX feeds_up_url_observed
ON feeds_up
USING btree
(feed_url COLLATE pg_catalog."default", observed DESC);
CREATE INDEX feeds_up_observed
ON public.feeds_up
USING btree
(observed DESC);
Assuming that "id" is serial and always sequential, you might simplify by finding the MAX(id) for each feed_url in a subquery and then pull in the rest of the data as follows:
SELECT fu.feed_url, fu.isup, fu.hasproblems, fu.observed
FROM feeds_up fu
JOIN
(
SELECT feed_url, max(id) AS id FROM feeds_up
WHERE observed <= '2013-03-27T05:38:00.000Z'
GROUP BY feed_url
) AS q USING (id);
ORDER BY fu.feed_url, fu.observed desc;
I did a quick test and this works very efficiently utilizing only an index on "observed".
UPDATE:
To use "observed" instead of "id" (since records may not insert in order) you can modify above query as follows:
SELECT DISTINCT ON (fu.feed_url) fu.feed_url, fu.isup, fu.hasproblems, fu.observed
FROM feeds_up fu
JOIN
(
SELECT feed_url, max(observed) as observed FROM feeds_up
WHERE observed <= '2013-03-27T05:38:00.000Z'
GROUP BY feed_url
) AS q USING (feed_url, observed)
ORDER BY fu.feed_url, fu.observed desc;
On my system this ran in nearly the same time with the one index on "observed". YMMV
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With