I am implementing a simple web based RSS reader using python (not really relevant) and Postgresql (9.2 if relevant). The database schema is as follows (based on the RSS format):
CREATE TABLE feed_channel
(
id SERIAL PRIMARY KEY,
name TEXT,
link TEXT NOT NULL,
title TEXT
);
CREATE TABLE feed_content
(
id SERIAL PRIMARY KEY,
channel INTEGER REFERENCES feed_channel(id) ON DELETE CASCADE ON UPDATE CASCADE,
guid TEXT UNIQUE NOT NULL,
title TEXT,
link TEXT,
description TEXT,
pubdate TIMESTAMP
);
When I create a new channel (and also query for updated feed info) I request the feed, insert its data to the feed_channel table, selects the newly inserted ID - or existing to avoid duplicates - and then add the feed data to the feed_content table. A typical scenario would be:
This is a standard "insert if not already exists, but return relevant ID" problem. To solve this I have implemented the following stored procedure:
CREATE OR REPLACE FUNCTION channel_insert(
p_link feed_channel.link%TYPE,
p_title feed_channel.title%TYPE
) RETURNS feed_channel.id%TYPE AS $$
DECLARE
v_id feed_channel.id%TYPE;
BEGIN
SELECT id
INTO v_id
FROM feed_channel
WHERE link=p_link AND title=p_title
LIMIT 1;
IF v_id IS NULL THEN
INSERT INTO feed_channel(name,link,title)
VALUES (DEFAULT,p_link,p_title)
RETURNING id INTO v_id;
END IF;
RETURN v_id;
END;
$$ LANGUAGE plpgsql;
This is then called as "select channel_insert(link, title);" from my application to insert if not already exists and then return the ID of the relevant row regardless of whether it was inserted or just found (step 2 in list above).
This works great!
However, I recently started wondering what would happen if this procedure was executed twice at the same time with the same arguments. Lets assume the following:
Will this be a potential race condition in PostgreSQL? What is the best way to solve this problem to avoid such scenarios? Is it possible to make the entire stored procedure atomically, i.e. that it can only be executed once at the same time?
One option I tried was to make the fields Unique and then attempt to insert first, and if exception, select the existing instead... This worked, however, the SERIAL field would increment for each attempt, leaving a lot of gaps in the sequence. I don't know if that would be a problem in the long run (probably not), but kind of annoying. Perhaps this is the preferred solution?
Thanks for any feedback. This level of PostgreSQL magic is beyond me, so any feedback would be appreciated.
Will this be a potential race condition in PostgreSQL?
Yes, and in fact it would be in any database engine.
What is the best way to solve this problem to avoid such scenarios?
This is a loaded question and would require intimate knowledge of the usage of the database by multiple users. However, I'm going to give you some options. In short, the only option you have is to LOCK
the table during this process, but how you lock that table will depend on how the database is used throughout the day.
Let's start with the basic LOCK
:
LOCK TABLE feed_channel
That will lock the table using the ACCESS EXCLUSIVE
lock option.
Conflicts with locks of all modes (ACCESS SHARE, ROW SHARE, ROW EXCLUSIVE, SHARE UPDATE EXCLUSIVE, SHARE, SHARE ROW EXCLUSIVE, EXCLUSIVE, and ACCESS EXCLUSIVE). This mode guarantees that the holder is the only transaction accessing the table in any way.
Now, that is the most restrictive lock available, and would certainly resolve the race condition, but may not be exactly what you want. That's something you're going to have to decide. So, though it's clear you're going to have to LOCK
the table, it's not clear how.
What are you left to decide?
LOCK
the table? Study the locking options at that link to make your determination.LOCK
the table? Or in other words, do you want to LOCK
at the top of the function (which I think you do based on the possible race condition), or do you simply want to LOCK
right before the INSERT
?Is it possible to make the entire stored procedure atomically, i.e. that it can only be executed once at the same time?
No, the code can be executed by anybody that's connected to the database.
I hope that this has helped direct you.
There is an unavoidable "race" here, since two sessions cannot "see" eachothers uncommited rows. On a conflict, a session could only rollback (maybe to a savepoint) and retry. That would typically mean: referring to the other's freshly inserted row, instead of creating a private duplicate.
There is a data-modelling problem here: feed_channel appears to have a lot of candidate keys, and the cascading rule from feed_content could orphanise a lot of feed_content's rows (I suppose content-> channel is 1::M relation; more than one contents-row could refer to the same channel)
Finally, the feed_channel table at least needs the natural key {link,title}. That is where the insert/not exists is all about. (and the whole purpose of this function)
I cleaned up the function a bit. The IF construct is not needed, doing an INSERT WHERE NOT EXISTS first works just as well, and maybe even better.
DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;
CREATE TABLE feed_channel
( id SERIAL PRIMARY KEY
, name TEXT
, link TEXT NOT NULL
, title TEXT NOT NULL -- part of PK :: must be not nullable
, CONSTRAINT feed_channel_nat UNIQUE (link,title) -- the natural key
);
CREATE TABLE feed_content
( id SERIAL PRIMARY KEY
, channel INTEGER REFERENCES feed_channel(id) ON DELETE CASCADE ON UPDATE CASCADE
, guid TEXT UNIQUE NOT NULL -- yet another primary key
, title TEXT --
, link TEXT -- title && link appear to be yet another candidate key
, description TEXT
, pubdate TIMESTAMP
);
-- NOTE: omitted original function channel_insert() for brevity
CREATE OR REPLACE FUNCTION channel_insert_wp(
p_link feed_channel.link%TYPE,
p_title feed_channel.title%TYPE
) RETURNS feed_channel.id%TYPE AS $body$
DECLARE
v_id feed_channel.id%TYPE;
BEGIN
INSERT INTO feed_channel(link,title)
SELECT p_link,p_title
WHERE NOT EXISTS ( SELECT *
FROM feed_channel nx
WHERE nx.link= p_link
AND nx.title= p_title
)
;
SELECT id INTO v_id
FROM feed_channel ex
WHERE ex.link= p_link
AND ex.title= p_title
;
RETURN v_id;
END;
$body$ LANGUAGE plpgsql;
SELECT channel_insert('Bogus_link', 'Bogus_title');
SELECT channel_insert_wp('Bogus_link2', 'Bogus_title2');
SELECT * FROM feed_channel;
Results:
DROP SCHEMA
CREATE SCHEMA
SET
NOTICE: CREATE TABLE will create implicit sequence "feed_channel_id_seq" for serial column "feed_channel.id"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "feed_channel_pkey" for table "feed_channel"
NOTICE: CREATE TABLE / UNIQUE will create implicit index "feed_channel_nat" for table "feed_channel"
CREATE TABLE
NOTICE: CREATE TABLE will create implicit sequence "feed_content_id_seq" for serial column "feed_content.id"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "feed_content_pkey" for table "feed_content"
NOTICE: CREATE TABLE / UNIQUE will create implicit index "feed_content_guid_key" for table "feed_content"
CREATE TABLE
NOTICE: type reference feed_channel.link%TYPE converted to text
NOTICE: type reference feed_channel.title%TYPE converted to text
NOTICE: type reference feed_channel.id%TYPE converted to integer
CREATE FUNCTION
NOTICE: type reference feed_channel.link%TYPE converted to text
NOTICE: type reference feed_channel.title%TYPE converted to text
NOTICE: type reference feed_channel.id%TYPE converted to integer
CREATE FUNCTION
channel_insert
----------------
1
(1 row)
channel_insert_wp
-------------------
2
(1 row)
id | name | link | title
----+------+-------------+--------------
1 | | Bogus_link | Bogus_title
2 | | Bogus_link2 | Bogus_title2
(2 rows)
Your foremost problem is that a serial
does not make a good primary key for the feed_channel
table. The primary key should be (link, title)
or just (link)
if title
can be null
. Then any try to insert an existent feed would raise a primary key error.
BTW v_id
will be null
whenever title
is null
:
WHERE link=p_link AND title=p_title
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With