Here's my situation. I have a table with a bunch of URLs and crawl dates associated with them. When my program processes a URL, I want to INSERT a new row with a crawl date. If the URL already exists, I want to update the crawl date to the current datetime. With MS SQL or Oracle I'd probably use a MERGE command for this. With mySQL I'd probably use the ON DUPLICATE KEY UPDATE syntax.
I could do multiple queries in my program, which may or may not be thread safe. I could write a SQL function which has various IF...ELSE logic. However, for the sake of trying out Postgres features I've never used before, I'm thinking about creating an INSERT rule - something like this:
CREATE RULE Pages_Upsert AS ON INSERT TO Pages
WHERE EXISTS (SELECT 1 from Pages P where NEW.Url = P.Url)
DO INSTEAD
UPDATE Pages SET LastCrawled = NOW(), Html = NEW.Html WHERE Url = NEW.Url;
This seems to actually work great. It probably loses some points on the "code readability" standpoint, as someone looking at my code for the first time would have to magically know about this rule, but I guess that could be solved with good code commenting and documentation.
Are there any other drawbacks to this idea, or maybe a "your idea sucks, you should do it /this/ way instead" comment? I'm on PG 9.0 if that matters.
UPDATE: Query plan since someone wanted it :)
"Insert (cost=2.79..2.81 rows=1 width=0)"
" InitPlan 1 (returns $0)"
" -> Seq Scan on pages p (cost=0.00..2.79 rows=1 width=0)"
" Filter: ('http://www.foo.com'::text = lower((url)::text))"
" -> Result (cost=0.00..0.01 rows=1 width=0)"
" One-Time Filter: ($0 IS NOT TRUE)"
""
"Update (cost=2.79..5.46 rows=1 width=111)"
" InitPlan 1 (returns $0)"
" -> Seq Scan on pages p (cost=0.00..2.79 rows=1 width=0)"
" Filter: ('http://www.foo.com'::text = lower((url)::text))"
" -> Result (cost=0.00..2.67 rows=1 width=111)"
" One-Time Filter: $0"
" -> Seq Scan on pages (cost=0.00..2.66 rows=1 width=111)"
" Filter: ((url)::text = 'http://www.foo.com'::text)"
Ok, I managed to create a testcase. The result is that the update part is always executed, even on a fresh insert. COPY seems to bypass the rule system. [For clarity I have put this into a separate reply]
DROP TABLE pages CASCADE;
CREATE TABLE pages
( url VARCHAR NOT NULL PRIMARY KEY
, html VARCHAR
, last TIMESTAMP
);
INSERT INTO pages(url,html,last) VALUES ('www.example.com://page1' , 'meuk1' , '2001-09-18 23:30:00'::timestamp );
CREATE RULE Pages_Upsert AS ON INSERT TO pages
WHERE EXISTS (SELECT 1 from pages P where NEW.url = P.url)
DO INSTEAD (
UPDATE pages SET html=new.html , last = NOW() WHERE url = NEW.url
);
INSERT INTO pages(url,html,last) VALUES ('www.example.com://page2' , 'meuk2' , '2002-09-18 23:30:00':: timestamp );
INSERT INTO pages(url,html,last) VALUES ('www.example.com://page3' , 'meuk3' , '2003-09-18 23:30:00':: timestamp );
INSERT INTO pages(url,html,last) SELECT pp.url || '/added'::text, pp.html || '.html'::text , pp.last + interval '20 years' FROM pages pp;
COPY pages(url,html,last) FROM STDIN;
www.example.com://pageX stdin 2000-09-18 23:30:00
\.
SELECT * FROM pages;
The result:
url | html | last
-------------------------------+------------+----------------------------
www.example.com://page1 | meuk1 | 2001-09-18 23:30:00
www.example.com://page2 | meuk2 | 2011-09-18 23:48:30.775373
www.example.com://page3 | meuk3 | 2011-09-18 23:48:30.783758
www.example.com://page1/added | meuk1.html | 2011-09-18 23:48:30.792097
www.example.com://page2/added | meuk2.html | 2011-09-18 23:48:30.792097
www.example.com://page3/added | meuk3.html | 2011-09-18 23:48:30.792097
www.example.com://pageX | stdin | 2000-09-18 23:30:00
(7 rows)
UPDATE: Just to prove it can be done:
INSERT INTO pages(url,html,last) VALUES ('www.example.com://page1' , 'meuk1' , '2001-09-18 23:30:00'::timestamp );
CREATE VIEW vpages AS (SELECT * from pages);
CREATE RULE Pages_Upsert AS ON INSERT TO vpages
DO INSTEAD (
UPDATE pages p0
SET html=NEW.html , last = NOW() WHERE p0.url = NEW.url
;
INSERT INTO pages (url,html,last)
SELECT NEW.url, NEW.html, NEW.last
WHERE NOT EXISTS ( SELECT * FROM pages p1 WHERE p1.url = NEW.url)
);
CREATE RULE Pages_Indate AS ON UPDATE TO vpages
DO INSTEAD (
INSERT INTO pages (url,html,last)
SELECT NEW.url, NEW.html, NEW.last
WHERE NOT EXISTS ( SELECT * FROM pages p1 WHERE p1.url = OLD.url)
;
UPDATE pages p0
SET html=NEW.html , last = NEW.last WHERE p0.url = NEW.url
;
);
INSERT INTO vpages(url,html,last) VALUES ('www.example.com://page2' , 'meuk2' , '2002-09-18 23:30:00':: timestamp );
INSERT INTO vpages(url,html,last) VALUES ('www.example.com://page3' , 'meuk3' , '2003-09-18 23:30:00':: timestamp );
INSERT INTO vpages(url,html,last) SELECT pp.url || '/added'::text, pp.html || '.html'::text , pp.last + interval '20 years' FROM vpages pp;
UPDATE vpages SET last = last + interval '-10 years' WHERE url = 'www.example.com://page1' ;
-- Copy does NOT work on views
-- COPY vpages(url,html,last) FROM STDIN;
-- www.example.com://pageX stdin 2000-09-18 23:30:00
-- \.
SELECT * FROM vpages;
Result:
INSERT 0 1
INSERT 0 1
INSERT 0 3
UPDATE 1
url | html | last
-------------------------------+------------+---------------------
www.example.com://page2 | meuk2 | 2002-09-18 23:30:00
www.example.com://page3 | meuk3 | 2003-09-18 23:30:00
www.example.com://page1/added | meuk1.html | 2021-09-18 23:30:00
www.example.com://page2/added | meuk2.html | 2022-09-18 23:30:00
www.example.com://page3/added | meuk3.html | 2023-09-18 23:30:00
www.example.com://page1 | meuk1 | 1991-09-18 23:30:00
(6 rows)
The view is necessary to prevent the rewrite system to go into recursion. Construction of a DELETE rule is left as an exercise to the reader.
Some good points from someone who should know it or be very near to someone like that ;-)
What are PostgreSQL RULEs good for?
Short story:
SERIAL
and BIGSERIAL
?RETURNING
clauses of INSERT
and UPDATE
?random()
?All these things boils down to the fact, that the rule system is not row driven but transforms your statements in a way you never imagine.
Do yourself and your team mates a favour and stop using roles for things like that.
Edit: Your problem is well discussed in the PostgreSQL community. Search keywords are: MERGE
, UPSERT
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With